Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: may i ask what is the diffrent between TTA and TTM? #190

Open
rainbowjack opened this issue Apr 24, 2024 · 2 comments
Open

[Help]: may i ask what is the diffrent between TTA and TTM? #190

rainbowjack opened this issue Apr 24, 2024 · 2 comments
Assignees

Comments

@rainbowjack
Copy link

Is TTA included in TTM?

@viewfinder-annn viewfinder-annn self-assigned this Apr 25, 2024
@viewfinder-annn
Copy link
Collaborator

viewfinder-annn commented Apr 25, 2024

Hi @rainbowjack, nice question!

tl;dr: text-to-audio (TTA) includes text-to-music (TTM). You can use text-music pairs to train a TTA model, which turns into a TTM model actually, but it may require large amounts of data due to the inner structure(tempo, harmony, melody, etc.) in music piece.

Theoretically, audio includes music. AudioLDM[1] denotes audio sound effects, music, or speech, AudioLM[2] says AUDIO signals, be they speech, music or environmental, AudioGen[3] refers audio as soundscapes to music or speech. Thus, it is not that "TTA is included in TTM," but rather, TTM is a subclass of TTA, where music represents a more abstract form of audio, including more inner structures(tempo, harmony, melody, etc.)

Currently, under Amphion framework, there is a TTA model based on latent diffusion. If you obtain text-music data pairs, you can use them directly to train a TTM model. However, it is important to note that music generation models generally require vast amounts of data (340k hours for noise2music[4], 280k hours for MusicLM[5], 20k hours for musicgen[6], 46k hours for singsong[7]), so if the results are not satisfactory, it is likely due to the quality of data rather than the model itself.

Furthermore, we are also developing some music generation framework, stay tuned if you are interested :)

[1] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
[2] AudioLM: a Language Modeling Approach to Audio Generation
[3] AudioGen: Textually Guided Audio Generation
[4] Noise2Music: Text-conditioned Music Generation with Diffusion Models
[5] MusicLM: Generating Music From Text
[6] Simple and Controllable Music Generation
[7] SingSong: Generating musical accompaniments from singing

@rainbowjack
Copy link
Author

嗨,好问题!

tl;dr:**文本转音频 (TTA) 包括文本转音乐 (TTM)。**您可以使用文本-音乐对来训练 TTA 模型,该模型实际上变成了 TTM 模型,但由于音乐作品的内部结构(速度、和声、旋律等),它可能需要大量数据。

从理论上讲,音频包括音乐。AudioLDM[1]表示音频,AudioLM[2]表示,AudioGen[3]表示音频。因此,并不是说“TTA 包含在 TTM 中”,而是 TTM 是 TTA 的一个子类,其中音乐代表了一种更抽象的音频形式,包括更多的内部结构(速度、和声、旋律等)。sound effects, music, or speech``AUDIO signals, be they speech, music or environmental``soundscapes to music or speech

目前,在Amphion框架下,有一个基于潜在扩散的TTA模型。如果获取文本-音乐数据对,则可以直接使用它们来训练 TTM 模型。然而,需要注意的是,音乐生成模型通常需要大量的数据(noise2music[4]为340k小时,MusicLM[5]为280k小时,musicgen[6]为20k小时,singsong[7]为46k小时),因此,如果结果不令人满意,很可能是由于数据质量而不是模型本身。

此外,我们还在开发一些音乐生成框架,如果您有兴趣,请继续关注:)

[1] AudioLDM:使用潜在扩散模型生成文本到音频 [2] AudioLM:音频生成的语言建模方法 [3] AudioGen:文本引导音频生成 [4] Noise2Music:使用扩散模型生成文本条件音乐 [5] MusicLM:从文本生成音乐 [6] 简单可控的音乐生成 [7] SingSong:从歌唱中产生音乐伴奏

Thank you very much, I just need to do some research on music synthesis or genre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants