E3-VITS

Samples are available in GitHub Pages!!!

Title: E3-VITS: Emotional End-to-End TTS with Cross-speaker Style Transfer (Paper link)

Abstract: Since previous emotional TTS models are based on a two-stage pipeline or additional labels, their training process is complex and requires a high labeling cost. To deal with this problem, this paper presents E3-VITS, an end-to-end emotional TTS model that addresses the limitations of existing models. E3-VITS synthesizes high-quality speeches for multi-speaker conditions, supports both reference speech and textual description-based emotional speech synthesis, and enables cross-speaker emotion transfer with a disjoint dataset. To implement E3-VITS, we propose batch-permuted style perturbation, which generates audio samples with unpaired emotion to increase the quality of cross-speaker emotion transfer. Results show that E3-VITS outperforms the baseline model in terms of naturalness, speaker and emotion similarity, and inference speed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
demo_samples		demo_samples
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo_samples

demo_samples

README.md

README.md

index.html

index.html

Repository files navigation

E3-VITS

About

Releases

Packages

Contributors 2

Languages

Wonbin-Jung/e3-vits

Folders and files

Latest commit

History

Repository files navigation

E3-VITS

About

Topics

Resources

Stars

Watchers

Forks

Languages