Skip to content

Latest commit

 

History

History
190 lines (171 loc) · 5.65 KB

README.md

File metadata and controls

190 lines (171 loc) · 5.65 KB

Getting Started

Below we provide instructions for training and inference on audio and vision-language tasks. Pretrained and finetuned checkpoints are provided in checkpoints.md.

We recommend that your workspace directory should be organized like this:

ONE-PEACE/
├── assets/
├── fairseq/
├── one_peace/
│   ├── checkpoints
│   │   ├── one-peace.pt
│   ├── criterions
│   ├── data
│   ├── dataset
│   │   ├── esc50/
│   │   ├── flickr30k/
│   ├── metrics
│   └── ...
├── .gitignore
├── LICENSE
├── README.md
├── checkpoints.md
├── datasets.md
├── requirements.txt

Please note that if your device does not support bf16 precision, you can switch to fp16 precision for fine-tuning or inference.

common:
  # # use bf16
  # fp16: false
  # memory_efficient_fp16: false
  # bf16: true
  # memory_efficient_bf16: true

  # use fp16
  fp16: true
  memory_efficient_fp16: true
  bf16: false
  memory_efficient_bf16: false

Pretraining

The overall pretraining process of ONE-PEACE is divided into two stages: vision-language pretraining and audio-language pretraining.

Vision-Language Pretraining (Stage1 Pretraining)

Here we provide an example of vision-language pretraining.

  1. Download COCO. You can also replace COCO with your own datasets.
  2. Pretraining
cd one_peace/run_scripts/pretrain
bash pretrain_vl_3B.sh

Audio-Language Pretraining (Stage2 Pretraining)

At the audio-language pretraining stage, we initialized the model with the pretrained checkpoint of vision-language pretraining, and trains the model with audio-text pairs.

  1. Download AudioCaps, Clotho and MACS. You can also prepare your own datasets.
  2. Pretraining. Remember to load the pretrained checkpoint of vision-language pretraining
cd one_peace/run_scripts/pretrain
bash pretrain_al_3B.sh

Finetuing and Inference

ESC-50

  1. Download ESC-50
  2. Inference
cd one_peace/run_scripts/esc50
bash zero_shot_evaluate.sh

Image-Text Retrieval

  1. Download COCO and Flickr
  2. Finetuning
cd one_peace/run_scripts/image_text_retrieval
bash finetune_coco.sh
bash finetune_flickr.sh
  1. Inference
cd one_peace/run_scripts/image_text_retrieval
bash zero_shot_evaluate_coco.sh  # zero-shot retrieval for COCO
bash zero_shot_evaluate_flickr.sh  # zero-shot retrieval for Flickr30K
bash evaluate_coco.sh  # evaluation for COCO
bash evaluate_flickr.sh  # evaluation for Flickr30K

NLVR2

  1. Download NLVR2
  2. Finetuning
cd one_peace/run_scripts/nlvr2
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/nlvr2
bash evaluate.sh

Visual Grounding

  1. Download RefCOCO, RefCOCO+ and RefCOCOg
  2. Finetuning
cd one_peace/run_scripts/visual_grounding
bash finetune_refcoco.sh
bash finetune_refcoco+.sh
bash finetune_refcocog.sh
  1. Inference
cd one_peace/run_scripts/visual_grounding
bash evaluate_refcoco.sh  # evaluation for RefCOCO
bash evaluate_refcoco+.sh  # evaluation for RefCOCO+
bash evaluate_refcocog.sh  # evaluation for RefCOCOg

VQA

  1. Download VQAv2
  2. Finetuning
cd one_peace/run_scripts/vqa
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/vqa
bash evaluate.sh

Audio-Text Retrieval

  1. Download AudioCaps, Clotho and MACS
  2. Finetuning
cd one_peace/run_scripts/audio_text_retrieval
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/audio_text_retrieval
bash evaluate.sh

Audio Question Answering (AQA)

  1. Download AVQA
  2. Finetuning
cd one_peace/run_scripts/aqa
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/aqa
bash evaluate.sh

FSD50K

  1. Download FSD50K
  2. Finetuning
cd one_peace/run_scripts/fsd50k
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/fsd50k
bash evaluate.sh

Vggsound

  1. Download Vggsound
  2. Finetuning
cd one_peace/run_scripts/vggsound
bash finetune.sh
  1. Inference
cd one_peace/run_scripts/vggsound
bash evaluate.sh