Skip to content

demo of domain corpus bootstrapping using language model perplexity

Notifications You must be signed in to change notification settings

SNUDerek/lm_perplexity_bootstrapping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Corpus Bootstrapping with LM Perplexity

Purpose

use language model perplexity to augment a small domain-specific sentence by selecting 'similar' sentences from an unlabeled corpus (e.g. web-crawled data) using

based on Ramaswamy, Printz, Gopalakrishnan: A Bootstrap Technique for Building Domain-Dependent Language Models, available here: http://mirlab.org/conference_papers/International_Conference/ICSLP%201998/PDF/SCAN/SL980611.PDF

Requirements

  • nltk
  • kenlm (LM in C++, install python extensions with setup.py)

Procedure

build a seed corpus of in-domain data, then:

iterate:

  1. build language model
  2. evaluate perplexity of unlabeled sents under this model
  3. add n sents under the perplexity threshhold to the corpus

terminate when no new sentences are under the threshhold

Results

see the jupyter notebooks for demos of selecting Jane Austen sentences from a mixture of sentences from Austen, Lewis Carroll and Herman Melville

Resources

For KenLM: