Skip to content

Latest commit

 

History

History
286 lines (201 loc) · 25.9 KB

awesome_audio_gallery.md

File metadata and controls

286 lines (201 loc) · 25.9 KB

Awesome Audio Gallery

Detection

  • Retrieval-Augmented Audio Deepfake Detection, arXiv, 2404.13892, arxiv, pdf, cication: -1

    Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

  • Cross-Domain Audio Deepfake Detection: Dataset and Analysis, arXiv, 2404.04904, arxiv, pdf, cication: -1

    Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

  • Detecting Multimedia Generated by Large AI Models: A Survey, arXiv, 2402.00045, arxiv, pdf, cication: -1

    Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu · (Detect-LAIM-generated-Multimedia-Survey - Purdue-M2) Star

  • Proactive Detection of Voice Cloning with Localized Watermarking, arXiv, 2401.17264, arxiv, pdf, cication: -1

    Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar · (audioseal - facebookresearch) Star

Courses and Tutorials

DLHLP 2020 Spring

Large Audio Model

Other

Speech Translation

  • gentranslate - yuchen005 Star

    Code for paper "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators"

  • PolyVoice: Language Models for Speech to Speech Translation, arXiv, 2306.02982, arxiv, pdf, cication: -1

    Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng · (speechtranslation.github)

  • fairseq - facebookresearch Star

Toolkits

  • speechbrain - speechbrain Star

    A PyTorch-based Speech Toolkit · (huggingface)

  • demucs - facebookresearch Star

    Code for the paper Hybrid Spectrogram and Waveform Source Separation

  • Resemblyzer - resemble-ai Star

    A python package to analyze and compare voices with deep learning

  • FRCRN - alibabasglab Star

  • seewav - adefossez Star

    Audio waveform visualisation, converts any audio to a nice video · (huggingface)

  • NeMo-text-processing - NVIDIA Star

    NeMo text processing for ASR and TTS

  • OpenPhonemizer - NeuralVox Star

    Permissively licensed, open sourced, local IPA Phonemizer (G2P) powered by deep learning.

  • fullstop-deep-punctuation-prediction - oliverguhr Star

    A model that predicts the punctuation of English, Italian, French and German texts. · (huggingface)

  • nendo - okio-ai Star

    The Nendo AI Audio Tool Suite

  • deepfilternet - rikorose Star

    Noise supression using deep filtering

  • CharsiuG2P - lingjzhu Star

    Multilingual G2P in 100 languages

  • (Inverse) Text Normalization — NVIDIA NeMo

  • audio-preprocess - fishaudio Star

    Preprocess Audio for training

  • pyloudnorm - csteinmetz1 Star

    Flexible audio loudness meter in Python with implementation of ITU-R BS.1770-4 loudness algorithm

  • ultimatevocalremovergui - Anjok07 Star

    GUI for a Vocal Remover that uses Deep Neural Networks.

  • Amphion: An Open-Source Audio, Music and Speech Generation Toolkit, arXiv, 2312.09911, arxiv, pdf, cication: -1

    Xueyao Zhang, Liumeng Xue, Yuancheng Wang, Yicheng Gu, Xi Chen, Zihao Fang, Haopeng Chen, Lexiao Zou, Chaoren Wang, Jun Han · (huggingface) · (Amphion - open-mmlab) Star

  • resemble-enhance - resemble-ai Star

    AI powered speech denoising and enhancement

  • awesome-python - vinta Star

    A curated list of awesome Python frameworks, libraries, software and resources

  • pyannote-audio - pyannote Star

    Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

  • audiocraft - facebookresearch Star

    Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

  • audio-slicer - openvpi Star

    Python script that slices audio with silence detection

  • autocut - mli Star

  • phonemizer - bootphon Star

    Simple text to phones converter for multiple languages

  • g2pM - kakaobrain Star

    A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

  • g2pC - Kyubyong Star

    g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese

  • AcademiCodec - yangdongchao Star

    AcademiCodec: An Open Source Audio Codec Model for Academic Research

  • Python-Wrapper-for-World-Vocoder - JeremyCCHsu Star

    A Python wrapper for the high-quality vocoder "World"

  • g2p-kd - sigmeta Star

    Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

  • speechbrain - speechbrain Star

    A PyTorch-based Speech Toolkit

  • ffmpeg-normalize - slhck Star

    Audio Normalization for Python/ffmpeg

  • Montreal-Forced-Aligner - MontrealCorpusTools Star

    Command line utility for forced alignment using Kaldi

  • WeTextProcessing - wenet-e2e Star

    Text Normalization & Inverse Text Normalization

  • python-pinyin - mozillazg Star

    汉字转拼音(pypinyin)

  • pypinyin-g2pW - mozillazg Star

    基于 g2pW 提升 pypinyin 的准确性

  • MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra, arXiv, 2305.13686, arxiv, pdf, cication: -1

    Ye-Xin Lu, Yang Ai, Zhen-Hua Ling · (MP-SENet - yxlu-0102) Star

Dataset

  • links_to_pocasts_lecture_and_shows_for_tts - laion 🤗

  • Audio Dialogues: Dialogues dataset for audio and music understanding, arXiv, 2404.07616, arxiv, pdf, cication: -1

    Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro · (audiodialogues.github)

  • yodas - espnet 🤗

  • An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation, arXiv, 2402.16380, arxiv, pdf, cication: -1

    Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres

  • common_voice_17_0 - mozilla-foundation 🤗

  • common_voice_16_0 - mozilla-foundation 🤗

  • GigaSpeech - SpeechColab Star

    Large, modern dataset for speech recognition

  • 中英文数据收集

  • voice_datasets - jim-schwoebel Star

    🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

TTS

  • Fetching Title#v3ug

  • EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis, arXiv, 2308.05725, arxiv, pdf, cication: -1

    Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid · (huggingface) · (textlesslib - facebookresearch) Star

  • StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations, arXiv, 2404.14946, arxiv, pdf, cication: -1

    Sen Liu, Yiwei Guo, Xie Chen, Kai Yu · (goarsenal.github)

  • mls_eng_10k - parler-tts 🤗

  • LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus, arXiv, 2305.18802, arxiv, pdf, cication: -1

    Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna · (openslr) · (google.github)

  • Libri-Light: A Benchmark for ASR with Limited or No Supervision, icassp 2020-2020 ieee international conference on acoustics …, 2020, arxiv, pdf, cication: 483

    Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen · (libri-light - facebookresearch) Star

  • MLS: A Large-Scale Multilingual Dataset for Speech Research, arXiv, 2012.03411, arxiv, pdf, cication: -1

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert · (openslr)

    • Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
    • segment the audio files into 10-20 second segments
  • LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech, arXiv, 1904.02882, arxiv, pdf, cication: -1

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

  • Hi-Fi Multi-Speaker English TTS Dataset, arXiv, 2104.01497, arxiv, pdf, cication: -1

    Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang · (openslr)

Audio Techs

  • EfficientSpeech: An On-Device Text to Speech Model, arXiv, 2305.13905, arxiv, pdf, cication: -1

    Rowel Atienza

  • Visual-Aware Text-to-Speech, arXiv, 2306.12020, arxiv, pdf, cication: -1

    Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

  • Proactive Detection of Voice Cloning with Localized Watermarking, arXiv, 2401.17264, arxiv, pdf, cication: -1

    Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar

  • FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation, arXiv, 2401.04283, arxiv, pdf, cication: -1

    Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei

  • AudioSep - Audio-AGI Star

    Official implementation of "Separate Anything You Describe"

  • AudioSR: Versatile Audio Super-resolution at Scale, arXiv, 2309.07314, arxiv, pdf, cication: -1

    Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley

Audio Visual

  • AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue, arXiv, 2403.16276, arxiv, pdf, cication: -1

    Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu

  • M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset, arXiv, 2403.14168, arxiv, pdf, cication: -1

    Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

  • Text-to-Audio Generation Synchronized with Videos, arXiv, 2403.07938, arxiv, pdf, cication: -1

    Shentong Mo, Jing Shi, Yapeng Tian

  • Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners, arXiv, 2402.17723, arxiv, pdf, cication: -1

    Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen · (yzxing87.github)

  • vsp-llm - sally-sh Star

  • RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation, arXiv, 2309.17189, arxiv, pdf, cication: -1

    Samuel Pegg, Kai Li, Xiaolin Hu · (jiqizhixin) · (cslikai) · (RTFS-Net - spkgyk) Star

Emotion Recognition

  • Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition, arXiv, 2403.19224, arxiv, pdf, cication: -1

    Siyuan Shen, Yu Gao, Feng Liu, Hanyang Wang, Aimin Zhou · (ENT - ECNU-Cross-Innovation-Lab) Star

Speech Separation

  • SPMamba: State-space model is all you need in speech separation, arXiv, 2404.02063, arxiv, pdf, cication: -1

    Kai Li, Guo Chen · (SPMamba - JusperLee) Star

Products

Extra Reference

  • speech-trident - ga642381 Star

    Awesome speech/audio LLMs, representation learning, and codec models

  • Large-Audio-Models - liusongxiang Star

    Keep track of big models in audio domain, including speech, singing, music etc.

  • Awesome-Speech-Generation - kuan2jiu99 Star

    Survey on speech generation work.

  • Speech-Prompts-Adapters - ga642381 Star

    This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing.