Awesome Audio Gallery

Awesome Audio Gallery

Detection

Retrieval-Augmented Audio Deepfake Detection, arXiv, 2404.13892, arxiv, pdf, cication: -1

Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
Cross-Domain Audio Deepfake Detection: Dataset and Analysis, arXiv, 2404.04904, arxiv, pdf, cication: -1

Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
Detecting Multimedia Generated by Large AI Models: A Survey, arXiv, 2402.00045, arxiv, pdf, cication: -1

Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu · (Detect-LAIM-generated-Multimedia-Survey - Purdue-M2)
Proactive Detection of Voice Cloning with Localized Watermarking, arXiv, 2401.17264, arxiv, pdf, cication: -1

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar · (audioseal - facebookresearch)

Courses and Tutorials

DLHLP 2020 Spring

DLHLP 2020 Spring
[DLHLP 2020] 李宏毅老师2020春课程-语音识别-语音合成-语音分离_哔哩哔哩_bilibili
[DLHLP 2020] Vocoder (由助教許博竣同學講授)_哔哩哔哩_bilibili
TTS Intro: [DLHLP 2020] Speech Synthesis (1/2) - Tacotron - YouTube

Large Audio Model

【機器學習2023】語音基石模型 (助教張凱為講授) (1/2) - YouTube
【機器學習2023】語音基石模型 (助教張凱為講授) (2/2) - YouTube
https://speech.ee.ntu.edu.tw/~hylee/ml/ml2023-course-data/張凱爲-x-機器學習-x-語音基石模型.pdf

Other

scaling law for multimodality
awesome-speech-recognition-speech-synthesis-papers - zzw922cn
Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion - guan-yuan
survey - tts-tutorial

A Survey on Neural Speech Synthesis
interspeech2022 - tts-tutorial
INTERSPEECH_Tutorial_TTS.pdf
INTERSPEECH_Tutorial_VC.pdf
https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf
AudioGPT - AIGC-Audio

Speech Translation

gentranslate - yuchen005

Code for paper "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators"
PolyVoice: Language Models for Speech to Speech Translation, arXiv, 2306.02982, arxiv, pdf, cication: -1

Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng · (speechtranslation.github)
fairseq - facebookresearch

Toolkits

speechbrain - speechbrain

A PyTorch-based Speech Toolkit · (huggingface)
demucs - facebookresearch

Code for the paper Hybrid Spectrogram and Waveform Source Separation
Resemblyzer - resemble-ai

A python package to analyze and compare voices with deep learning
FRCRN - alibabasglab
seewav - adefossez

Audio waveform visualisation, converts any audio to a nice video · (huggingface)
NeMo-text-processing - NVIDIA

NeMo text processing for ASR and TTS
OpenPhonemizer - NeuralVox

Permissively licensed, open sourced, local IPA Phonemizer (G2P) powered by deep learning.
fullstop-deep-punctuation-prediction - oliverguhr

A model that predicts the punctuation of English, Italian, French and German texts. · (huggingface)
nendo - okio-ai

The Nendo AI Audio Tool Suite
deepfilternet - rikorose

Noise supression using deep filtering
CharsiuG2P - lingjzhu

Multilingual G2P in 100 languages
(Inverse) Text Normalization — NVIDIA NeMo
audio-preprocess - fishaudio

Preprocess Audio for training
pyloudnorm - csteinmetz1

Flexible audio loudness meter in Python with implementation of ITU-R BS.1770-4 loudness algorithm
ultimatevocalremovergui - Anjok07

GUI for a Vocal Remover that uses Deep Neural Networks.
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit, arXiv, 2312.09911, arxiv, pdf, cication: -1

Xueyao Zhang, Liumeng Xue, Yuancheng Wang, Yicheng Gu, Xi Chen, Zihao Fang, Haopeng Chen, Lexiao Zou, Chaoren Wang, Jun Han · (huggingface) · (Amphion - open-mmlab)
resemble-enhance - resemble-ai

AI powered speech denoising and enhancement
awesome-python - vinta

A curated list of awesome Python frameworks, libraries, software and resources
pyannote-audio - pyannote

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
audiocraft - facebookresearch

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
audio-slicer - openvpi

Python script that slices audio with silence detection
autocut - mli
phonemizer - bootphon

Simple text to phones converter for multiple languages
g2pM - kakaobrain

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
g2pC - Kyubyong

g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
AcademiCodec - yangdongchao

AcademiCodec: An Open Source Audio Codec Model for Academic Research
Python-Wrapper-for-World-Vocoder - JeremyCCHsu

A Python wrapper for the high-quality vocoder "World"
g2p-kd - sigmeta

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion
speechbrain - speechbrain

A PyTorch-based Speech Toolkit
ffmpeg-normalize - slhck

Audio Normalization for Python/ffmpeg
Montreal-Forced-Aligner - MontrealCorpusTools

Command line utility for forced alignment using Kaldi
WeTextProcessing - wenet-e2e

Text Normalization & Inverse Text Normalization
python-pinyin - mozillazg

汉字转拼音(pypinyin)
pypinyin-g2pW - mozillazg

基于 g2pW 提升 pypinyin 的准确性
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra, arXiv, 2305.13686, arxiv, pdf, cication: -1

Ye-Xin Lu, Yang Ai, Zhen-Hua Ling · (MP-SENet - yxlu-0102)

Dataset

links_to_pocasts_lecture_and_shows_for_tts - laion 🤗
Audio Dialogues: Dialogues dataset for audio and music understanding, arXiv, 2404.07616, arxiv, pdf, cication: -1

Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro · (audiodialogues.github)
yodas - espnet 🤗
An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation, arXiv, 2402.16380, arxiv, pdf, cication: -1

Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres
common_voice_17_0 - mozilla-foundation 🤗
common_voice_16_0 - mozilla-foundation 🤗
GigaSpeech - SpeechColab

Large, modern dataset for speech recognition
中英文数据收集
voice_datasets - jim-schwoebel

🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

TTS

Fetching Title#v3ug
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis, arXiv, 2308.05725, arxiv, pdf, cication: -1

Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid · (huggingface) · (textlesslib - facebookresearch)
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations, arXiv, 2404.14946, arxiv, pdf, cication: -1

Sen Liu, Yiwei Guo, Xie Chen, Kai Yu · (goarsenal.github)
mls_eng_10k - parler-tts 🤗
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus, arXiv, 2305.18802, arxiv, pdf, cication: -1

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna · (openslr) · (google.github)
Libri-Light: A Benchmark for ASR with Limited or No Supervision, icassp 2020-2020 ieee international conference on acoustics …, 2020, arxiv, pdf, cication: 483

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen · (libri-light - facebookresearch)
MLS: A Large-Scale Multilingual Dataset for Speech Research, arXiv, 2012.03411, arxiv, pdf, cication: -1

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert · (openslr)
- Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- segment the audio files into 10-20 second segments
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech, arXiv, 1904.02882, arxiv, pdf, cication: -1

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu
Hi-Fi Multi-Speaker English TTS Dataset, arXiv, 2104.01497, arxiv, pdf, cication: -1

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang · (openslr)

Audio Techs

EfficientSpeech: An On-Device Text to Speech Model, arXiv, 2305.13905, arxiv, pdf, cication: -1

Rowel Atienza
Visual-Aware Text-to-Speech, arXiv, 2306.12020, arxiv, pdf, cication: -1

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei
Proactive Detection of Voice Cloning with Localized Watermarking, arXiv, 2401.17264, arxiv, pdf, cication: -1

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar
FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation, arXiv, 2401.04283, arxiv, pdf, cication: -1

Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei
AudioSep - Audio-AGI

Official implementation of "Separate Anything You Describe"
AudioSR: Versatile Audio Super-resolution at Scale, arXiv, 2309.07314, arxiv, pdf, cication: -1

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley

Audio Visual

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue, arXiv, 2403.16276, arxiv, pdf, cication: -1

Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset, arXiv, 2403.14168, arxiv, pdf, cication: -1

Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang
Text-to-Audio Generation Synchronized with Videos, arXiv, 2403.07938, arxiv, pdf, cication: -1

Shentong Mo, Jing Shi, Yapeng Tian
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners, arXiv, 2402.17723, arxiv, pdf, cication: -1

Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen · (yzxing87.github)
vsp-llm - sally-sh
RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation, arXiv, 2309.17189, arxiv, pdf, cication: -1

Samuel Pegg, Kai Li, Xiaolin Hu · (jiqizhixin) · (cslikai) · (RTFS-Net - spkgyk)

Emotion Recognition

Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition, arXiv, 2403.19224, arxiv, pdf, cication: -1

Siyuan Shen, Yu Gao, Feng Liu, Hanyang Wang, Aimin Zhou · (ENT - ECNU-Cross-Innovation-Lab)

Speech Separation

SPMamba: State-space model is all you need in speech separation, arXiv, 2404.02063, arxiv, pdf, cication: -1

Kai Li, Guo Chen · (SPMamba - JusperLee)

Products

PlayAI
Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.

· (twitter)

· (mp.weixin.qq)

Extra Reference

speech-trident - ga642381

Awesome speech/audio LLMs, representation learning, and codec models
Large-Audio-Models - liusongxiang

Keep track of big models in audio domain, including speech, singing, music etc.
Awesome-Speech-Generation - kuan2jiu99

Survey on speech generation work.
Speech-Prompts-Adapters - ga642381

This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_audio_gallery.md

awesome_audio_gallery.md

Awesome Audio Gallery

Detection

Courses and Tutorials

DLHLP 2020 Spring

Large Audio Model

Other

Speech Translation

Toolkits

Dataset

TTS

Audio Techs

Audio Visual

Emotion Recognition

Speech Separation

Products

Extra Reference

Files

awesome_audio_gallery.md

Latest commit

History

awesome_audio_gallery.md

File metadata and controls

Awesome Audio Gallery

Detection

Courses and Tutorials

DLHLP 2020 Spring

Large Audio Model

Other

Speech Translation

Toolkits

Dataset

TTS

Audio Techs

Audio Visual

Emotion Recognition

Speech Separation

Products

Extra Reference