Skip to content

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

License

Notifications You must be signed in to change notification settings

keonlee9420/PortaSpeech

Repository files navigation

PortaSpeech - PyTorch Implementation

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech.

Audio Samples

Audio samples are available at /demo.

Model Size

Module Normal Small Normal (paper) Small (paper)
Total 24M 7.6M 21.8M 6.7M
LinguisticEncoder 3.7M 1.4M - -
VariationalGenerator 11M 2.8M - -
FlowPostNet 9.3M 3.4M - -

Quickstart

DATASET refers to the names of datasets such as LJSpeech in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of PortaSpeech.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

Preprocessing

Run

python3 prepare_align.py --dataset DATASET

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

  • To use Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Normal Model

Small Model Loss

Notes

  • For vocoder, HiFi-GAN and MelGAN are supported.
  • No ReLU activation and LayerNorm in VariationalGenerator to avoid mashed output.
  • Speed ​​up the convergence of word-to-phoneme alignment in LinguisticEncoder by dividing long words into subwords and sorting the dataset by mel-spectrogram frame length.
  • There are two kinds of helper loss to improve word-to-phoneme alignment: "ctc" and "dga". You can toggle them as follows:
    # In the train.yaml
    aligner:
        helper_type: "dga" # ["dga", "ctc", "none"]
    • "dga": Diagonal Guided Attention (DGA) Loss
    • "ctc": Connectionist Temporal Classification (CTC) Loss with forward-sum algorithm
    • If you set "none", no helper loss will be applied during training.
    • The alignments comparision of three methods ("dga", "ctc", and "none" from top to bottom):
    • The default setting is "dga". Although "ctc" makes the strongest alignment, the output quality and the accuracy are worse than "dga".
    • But still, there is a room for the improvement of output quality. The audio quality and the alingment (accuracy) seem to be a trade-off.
  • Will be extended to a multi-speaker TTS.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References