-
Retrieval-Augmented Audio Deepfake Detection,
arXiv, 2404.13892
, arxiv, pdf, cication: -1Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
-
Cross-Domain Audio Deepfake Detection: Dataset and Analysis,
arXiv, 2404.04904
, arxiv, pdf, cication: -1Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
-
Detecting Multimedia Generated by Large AI Models: A Survey,
arXiv, 2402.00045
, arxiv, pdf, cication: -1Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu · (Detect-LAIM-generated-Multimedia-Survey - Purdue-M2)
-
Proactive Detection of Voice Cloning with Localized Watermarking,
arXiv, 2401.17264
, arxiv, pdf, cication: -1Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar · (audioseal - facebookresearch)
- DLHLP 2020 Spring
- [DLHLP 2020] 李宏毅老师2020春课程-语音识别-语音合成-语音分离_哔哩哔哩_bilibili
- [DLHLP 2020] Vocoder (由助教許博竣同學講授)_哔哩哔哩_bilibili
- TTS Intro: [DLHLP 2020] Speech Synthesis (1/2) - Tacotron - YouTube
- 【機器學習2023】語音基石模型 (助教張凱為講授) (1/2) - YouTube
- 【機器學習2023】語音基石模型 (助教張凱為講授) (2/2) - YouTube
- https://speech.ee.ntu.edu.tw/~hylee/ml/ml2023-course-data/張凱爲-x-機器學習-x-語音基石模型.pdf
-
awesome-speech-recognition-speech-synthesis-papers - zzw922cn
-
Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion - guan-yuan
-
survey - tts-tutorial
A Survey on Neural Speech Synthesis
-
interspeech2022 - tts-tutorial
-
https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf
-
AudioGPT - AIGC-Audio
-
gentranslate - yuchen005
Code for paper "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators"
-
PolyVoice: Language Models for Speech to Speech Translation,
arXiv, 2306.02982
, arxiv, pdf, cication: -1Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng · (speechtranslation.github)
-
fairseq - facebookresearch
-
speechbrain - speechbrain
A PyTorch-based Speech Toolkit · (huggingface)
-
demucs - facebookresearch
Code for the paper Hybrid Spectrogram and Waveform Source Separation
-
Resemblyzer - resemble-ai
A python package to analyze and compare voices with deep learning
-
FRCRN - alibabasglab
-
seewav - adefossez
Audio waveform visualisation, converts any audio to a nice video · (huggingface)
-
NeMo-text-processing - NVIDIA
NeMo text processing for ASR and TTS
-
OpenPhonemizer - NeuralVox
Permissively licensed, open sourced, local IPA Phonemizer (G2P) powered by deep learning.
-
fullstop-deep-punctuation-prediction - oliverguhr
A model that predicts the punctuation of English, Italian, French and German texts. · (huggingface)
-
nendo - okio-ai
The Nendo AI Audio Tool Suite
-
deepfilternet - rikorose
Noise supression using deep filtering
-
CharsiuG2P - lingjzhu
Multilingual G2P in 100 languages
-
audio-preprocess - fishaudio
Preprocess Audio for training
-
pyloudnorm - csteinmetz1
Flexible audio loudness meter in Python with implementation of ITU-R BS.1770-4 loudness algorithm
-
ultimatevocalremovergui - Anjok07
GUI for a Vocal Remover that uses Deep Neural Networks.
-
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit,
arXiv, 2312.09911
, arxiv, pdf, cication: -1Xueyao Zhang, Liumeng Xue, Yuancheng Wang, Yicheng Gu, Xi Chen, Zihao Fang, Haopeng Chen, Lexiao Zou, Chaoren Wang, Jun Han · (huggingface) · (Amphion - open-mmlab)
-
resemble-enhance - resemble-ai
AI powered speech denoising and enhancement
-
awesome-python - vinta
A curated list of awesome Python frameworks, libraries, software and resources
-
pyannote-audio - pyannote
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
-
audiocraft - facebookresearch
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
-
audio-slicer - openvpi
Python script that slices audio with silence detection
-
autocut - mli
-
phonemizer - bootphon
Simple text to phones converter for multiple languages
-
g2pM - kakaobrain
A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
-
g2pC - Kyubyong
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
-
AcademiCodec - yangdongchao
AcademiCodec: An Open Source Audio Codec Model for Academic Research
-
Python-Wrapper-for-World-Vocoder - JeremyCCHsu
A Python wrapper for the high-quality vocoder "World"
-
g2p-kd - sigmeta
Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion
-
speechbrain - speechbrain
A PyTorch-based Speech Toolkit
-
ffmpeg-normalize - slhck
Audio Normalization for Python/ffmpeg
-
Montreal-Forced-Aligner - MontrealCorpusTools
Command line utility for forced alignment using Kaldi
-
WeTextProcessing - wenet-e2e
Text Normalization & Inverse Text Normalization
-
python-pinyin - mozillazg
汉字转拼音(pypinyin)
-
pypinyin-g2pW - mozillazg
基于 g2pW 提升 pypinyin 的准确性
-
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra,
arXiv, 2305.13686
, arxiv, pdf, cication: -1Ye-Xin Lu, Yang Ai, Zhen-Hua Ling · (MP-SENet - yxlu-0102)
-
Audio Dialogues: Dialogues dataset for audio and music understanding,
arXiv, 2404.07616
, arxiv, pdf, cication: -1Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro · (audiodialogues.github)
-
yodas - espnet 🤗
-
An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation,
arXiv, 2402.16380
, arxiv, pdf, cication: -1Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres
-
common_voice_17_0 - mozilla-foundation 🤗
-
common_voice_16_0 - mozilla-foundation 🤗
-
GigaSpeech - SpeechColab
Large, modern dataset for speech recognition
-
voice_datasets - jim-schwoebel
🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
-
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis,
arXiv, 2308.05725
, arxiv, pdf, cication: -1Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid · (huggingface) · (textlesslib - facebookresearch)
-
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations,
arXiv, 2404.14946
, arxiv, pdf, cication: -1Sen Liu, Yiwei Guo, Xie Chen, Kai Yu · (goarsenal.github)
-
mls_eng_10k - parler-tts 🤗
-
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus,
arXiv, 2305.18802
, arxiv, pdf, cication: -1Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna · (openslr) · (google.github)
-
Libri-Light: A Benchmark for ASR with Limited or No Supervision,
icassp 2020-2020 ieee international conference on acoustics …, 2020
, arxiv, pdf, cication: 483Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen · (libri-light - facebookresearch)
-
MLS: A Large-Scale Multilingual Dataset for Speech Research,
arXiv, 2012.03411
, arxiv, pdf, cication: -1Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert · (openslr)
- Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- segment the audio files into 10-20 second segments
-
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,
arXiv, 1904.02882
, arxiv, pdf, cication: -1Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu
-
Hi-Fi Multi-Speaker English TTS Dataset,
arXiv, 2104.01497
, arxiv, pdf, cication: -1Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang · (openslr)
-
EfficientSpeech: An On-Device Text to Speech Model,
arXiv, 2305.13905
, arxiv, pdf, cication: -1Rowel Atienza
-
Visual-Aware Text-to-Speech,
arXiv, 2306.12020
, arxiv, pdf, cication: -1Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei
-
Proactive Detection of Voice Cloning with Localized Watermarking,
arXiv, 2401.17264
, arxiv, pdf, cication: -1Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar
-
FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation,
arXiv, 2401.04283
, arxiv, pdf, cication: -1Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei
-
AudioSep - Audio-AGI
Official implementation of "Separate Anything You Describe"
-
AudioSR: Versatile Audio Super-resolution at Scale,
arXiv, 2309.07314
, arxiv, pdf, cication: -1Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley
-
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue,
arXiv, 2403.16276
, arxiv, pdf, cication: -1Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu
-
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset,
arXiv, 2403.14168
, arxiv, pdf, cication: -1Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang
-
Text-to-Audio Generation Synchronized with Videos,
arXiv, 2403.07938
, arxiv, pdf, cication: -1Shentong Mo, Jing Shi, Yapeng Tian
-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners,
arXiv, 2402.17723
, arxiv, pdf, cication: -1Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen · (yzxing87.github)
-
vsp-llm - sally-sh
-
RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation,
arXiv, 2309.17189
, arxiv, pdf, cication: -1Samuel Pegg, Kai Li, Xiaolin Hu · (jiqizhixin) · (cslikai) · (RTFS-Net - spkgyk)
-
Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition,
arXiv, 2403.19224
, arxiv, pdf, cication: -1Siyuan Shen, Yu Gao, Feng Liu, Hanyang Wang, Aimin Zhou · (ENT - ECNU-Cross-Innovation-Lab)
-
SPMamba: State-space model is all you need in speech separation,
arXiv, 2404.02063
, arxiv, pdf, cication: -1Kai Li, Guo Chen · (SPMamba - JusperLee)
-
Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.
· (twitter)
· (mp.weixin.qq)
-
speech-trident - ga642381
Awesome speech/audio LLMs, representation learning, and codec models
-
Large-Audio-Models - liusongxiang
Keep track of big models in audio domain, including speech, singing, music etc.
-
Awesome-Speech-Generation - kuan2jiu99
Survey on speech generation work.
-
Speech-Prompts-Adapters - ga642381
This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing.