Skip to content

k9luo/Punctuation-Restoration

Repository files navigation

Punctuation Restoration

Requirements

Imagine that you are building a software for transcribing speech to text. The speech transcription part works perfectly, but cannot transcribe punctuations. The task is to train a predictive model to ingest a sequence of text and add punctuation (period, comma or question mark) in the appropriate locations. This task is important for all downstream data processing jobs.

Example input:

this is a string of text with no punctuation this is a new sentence

Example output:

this is a string of text with no punctuation <period> this is a new sentence <period>

Solution

My solution is largely based on Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration.

The architecture is defined as follows:

  1. Obtain words embeddings from GloVe.
  2. The word embeddings are then processed by densely connected Bi-LSTM layers.
  3. These Bi-LSTM layers are followed by a RNN with an attention mechanism and conditional random field (CRF) log likelihood loss.

The experiments are performed on the IWSLT dataset which consists of TED Talks transcript.

The detailed analysis can be found in this notebook.

Setup and Installation

First step, clone the repo:

https://github.com/k9luo/Punctuation-Restoration.git

Second step, you can download pretrained GloVe word embeddings and create a new conda virutal environment with setup.sh. Or you can manually do these steps yourself. Note that the running setup.sh will install the GPU version of TensorFlow:

sh setup.sh

Third step, activate the virtual environment:

conda activate restore_punct

Fourth step, add the new virutal environment to Jupyter Notebook:

python -m ipykernel install --user --name=restore_punct

Training and Inference

Please run python main.py.