Skip to content

This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

License

Notifications You must be signed in to change notification settings

UCSC-VLAA/Sight-Beyond-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Haoqin Tu*, Bingchen Zhao*, Chen Wei, Cihang Xie (*Equal Contribution)

Code License Data License

Our paper is online now: https://arxiv.org/abs/2309.07120

Installation

Please follow LLaVA for setting up the training environment.

Model Weights

We list all the model and vision-text projector weights used in the paper

Model Pretrain Weights Instruction Tuned Weights
LLaMA-7B ckpt Finetune ckpt
Vicuna-7B ckpt Finetune ckpt
LLaMA-3B ckpt Finetune ckpt
LoRA ckpt
Alpaca-3B ckpt Finetune ckpt
LoRA ckpt
LLaMA2-7B ckpt Finetune ckpt
LoRA ckpt
LLaMA2-chat-7B ckpt Finetune ckpt
LoRA ckpt

Evaluations

For NLP & Multi-Modal data and evaluations, please see instructions here.

Model Training

We follow the training paradigm of LLaVA, which consists of two stages: (1) feature alignment: use approximately 600K filtered CC3M to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning: use filtered 80K GPT-generated visual instruction data (see here) to teach the model to follow multimodal instructions.

Feature Alignment Training

Please download the subset of the CC3M dataset we use in the paper here. You can check the pretraining script

Pretrain: LLaMA2-7B.
deepspeed llava/train/train.py --deepspeed scripts/zero3.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --version v0 \
    --data_path /path/to/cc3m_595k.json \
    --image_folder /path/to/cc3m_595k_images \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints/MM-LLaMA2-7B-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

Visual Instruction Tuning

  1. Data preparation: Please download llava_instruct_80k.json and COCO train2017 images here
  2. Training: You can download our pretrained projector here, and check the finetuning script or LoRA tuning script.
Visual Instruction Tuning: MM-LLaMA2-7B-ft.
deepspeed llava/train/train.py --deepspeed scripts/zero2.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --version llava_llama_2 \
    --data_path path/to/llava_instruct_80k.json \
    --image_folder /path/to/coco/train2017/ \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/MM-LLaMA2-7B-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints/MM-LLaMA2-7B-ft \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Usage and License Notices

The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Citation

If you find this repo useful for your your research and applications, please cite using this BibTeX:

@article{tu2023sight,
  title={Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics},
  author={Tu, Haoqin and Zhao, Bingchen and Wei, Chen and Xie, Cihang},
  journal={arXiv preprint arXiv:2309.07120},
  year={2023}
}

Acknowledgement

This work is partially supported by a gift from Open Philanthropy. We thank Center for AI Safety for supporting our computing needs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Related Projects

  • Our training codes are largely borrow from LLaVA, which is truly an amazing resource.

About

This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published