Skip to content

Keyword spotting for audio with attention (KWS model for audio)

Notifications You must be signed in to change notification settings

Kirili4ik/kws-attention-pytorch

Repository files navigation

Keyword spotting task for audio files using attention (KWS attention)

Config: Model described in Shan et al., 2018 (Attention-based End-to-End Models for Small-Footprint Keyword Spotting; https://arxiv.org/abs/1803.10916)

Model is ~142k parameters, but can be changed with parameters. Config in main.py:

BATCH_SIZE = 256        (size of batch for learning)
NUM_EPOCHS = 35         (number of epochs to train model)
N_MELS     = 40         (number of mels for melspectrogram)

IN_SIZE = 40            (size of input)
HIDDEN_SIZE = 64        (size of hidden representation in 
KERNEL_SIZE = (20, 5)   (size of kernel for convolution layer in CRNN)
STRIDE = (8, 2)         (size of stride for convolution layer in CRNN)
GRU_NUM_LAYERS = 2      (number of GRU layers in CRNN)
NUM_DIRS = 2            (number of directions in GRU (2 if bidirectional))
NUM_CLASSES = 2         (number of classes (2 for "no word" or "sheila is in audio")

Data for training can be downloaded here: http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz (there is more info on downloading in the notebook)

For learning run "python main.py".

Code supports inferense as described in paper. Install requirements and use "python inference.py path/to/YOUR_AUDIO". It works even on cpu fast and easy. Script generates "path/to/YOUR_AUDIO.pdf" with graph of probabilities of word "Sheila" being said on audio.