Skip to content

[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning with audio-language model guidance and audio context keywords

Notifications You must be signed in to change notification settings

ExplainableML/ZerAuCap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Zero-shot audio captioning with audio-language model guidance and audio context keywords

NeurIPS - ML for Audio Workshop
PWC PWC
python pytorch black license

Description

This repository is the official implementation of the NeurIPS 2023 - Machine Learning for Audio Workshop (Oral) Zero-shot audio captioning with audio-language model guidance and audio context keywords by Leonard Salewski, Stefan Fauth, A. Sophia Koepke, and Zeynep Akata from the University of Tübingen and the Tübingen AI Center. You can find an ArXiv pre-print here.

Abstract

Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose ZerAuCap, a novel framework for summarising such general audio signals in a text caption without requiring task-specific training. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.

Code

Code is coming soon.

Citation

Please cite our work with the following bibtex key.

@article{Salewski2023ZeroShotAudio,
  title   = {Zero-shot audio captioning with audio-language model guidance and audio context keywords},
  author  = {Leonard Salewski and Stefan Fauth and A. Sophia Koepke and Zeynep Akata},
  year    = {2023},
  journal = {arxiv:2311.08396},
}

You can also find our work on Google Scholar and Semantic Scholar.

Funding and Acknowledgments

The authors thank IMPRS-IS for supporting Leonard Salewski. This work was partially funded by the BMBF Tübingen AI Center (FKZ: 01IS18039A), DFG (EXC number 2064/1 – Project number 390727645), and ERC (853489-DEXIM).

License

This repository is licensed under the MIT License.

About

[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning with audio-language model guidance and audio context keywords

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published