SNUDerek / lm_perplexity_bootstrapping Public

Notifications You must be signed in to change notification settings
Fork 3
Star 2

demo of domain corpus bootstrapping using language model perplexity

2 stars 3 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
bootstrapping_demo-KenLM.ipynb		bootstrapping_demo-KenLM.ipynb
bootstrapping_demo.ipynb		bootstrapping_demo.ipynb
process.py		process.py

Repository files navigation

Corpus Bootstrapping with LM Perplexity

Purpose

use language model perplexity to augment a small domain-specific sentence by selecting 'similar' sentences from an unlabeled corpus (e.g. web-crawled data) using

based on Ramaswamy, Printz, Gopalakrishnan: A Bootstrap Technique for Building Domain-Dependent Language Models, available here: http://mirlab.org/conference_papers/International_Conference/ICSLP%201998/PDF/SCAN/SL980611.PDF

Requirements

nltk
kenlm (LM in C++, install python extensions with setup.py)

Procedure

build a seed corpus of in-domain data, then:

iterate:

build language model
evaluate perplexity of unlabeled sents under this model
add n sents under the perplexity threshhold to the corpus

terminate when no new sentences are under the threshhold

Results

see the jupyter notebooks for demos of selecting Jane Austen sentences from a mixture of sentences from Austen, Lewis Carroll and Herman Melville

Resources

For KenLM:

About

demo of domain corpus bootstrapping using language model perplexity

text-classification language-modeling nltk bootstrapping kenlm language-model-perplexity perplexity

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages