Skip to content

Latest commit

 

History

History
76 lines (49 loc) · 3.82 KB

README-SVS.md

File metadata and controls

76 lines (49 loc) · 3.82 KB

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv GitHub Stars downloads Hugging Face

DiffSinger (SVS)

In PART1, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3]. If you want to conduct experiments with F0 prediction, please move to PART2.

Thus, the pipeline of this part can be summarized as:

[lyrics] -> [linguistic representation] (Frontend)
[linguistic representation] + [GT F0] + [GT phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] + [GT F0] -> [waveform] (Vocoder)

[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.

[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

Click here for detailed instructions: link.

Thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022 (after we published our paper).

Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.

2.A

Thus, the pipeline of 2.A can be summarized as:

[lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
[linguistic representation] + [predicted F0] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)

Click here for detailed instructions: link.

2.B

In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend and make a joint prediction with spectrograms.

Thus, the pipeline of 2.B can be summarized as:

[lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
[linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] -> [predicted F0]  (Pitch extractor)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)

Click here for detailed instructions: link.

FAQ

Q1: Why do I need F0 in Vocoders?

A1: See vocoder parts in HiFiSinger, DiffSinger or SingGAN. This is a common practice now.

Q2: Why not run MIDI version SVS on PopCS dataset? or Why not release MIDI labels for PopCS dataset?

A2: Our laboratory has no funds to label PopCS dataset. But there are funds for labeling other singing dataset, which is coming soon.

Q3: Why " 'HifiGAN' object has no attribute 'model' "?

A3: Please put the pretrained vocoders in your checkpoints dictionary.

Q4: How to check whether I use GT information or predicted information during inference from packed test set?

A4: Please see codes here.

...