Skip to content

Latest commit

 

History

History
125 lines (86 loc) · 11.1 KB

README.md

File metadata and controls

125 lines (86 loc) · 11.1 KB

LLaVA Full Pipeline

English | 简体中文

Configs

  • ./${LLM}_${ViT}/ contains configs that align with LLaVA-InternLM settings (i.e., using LoRA / QLoRA).
  • ./official/ contains configs that align with LLaVA official settings.

Results

XTuner primarily promotes the LLM-QLoRA / ViT-LoRA LLaVA architecture, and the evaluation results on various datasets are as follows:

Model MMBench Test (EN) MMBench Dev (EN) MMBench Test (CN) MMBench Dev (CN) CCBench Dev MME SEEDBench_IMG MMVet MMMU Dev MathVista MiniTest HallusionBench aAcc Configs Pretrained Projector Checkpoints Fine-tuned LLaVA Checkpoints
LLaVA-v1.5-7B (XTuner) 67.7 69.2 61.0 59.7 28.4 1716 66.4 32.2 33.7 24.2 46.2 Pretrain / Fine-tune 🤗 HuggingFace / 🤖 ModelScope 🤗 HuggingFace / 🤖 ModelScope
LLaVA-v1.5-13B (XTuner) 68.8 69.5 64.7 63.1 32.9 1766 67.9 35.9 35.2 26.2 46.9 Pretrain / Fine-tune 🤗 HuggingFace / 🤖 ModelScope 🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM-7B (XTuner) 69.0 68.5 66.7 63.8 37.3 1637 65.7 32.4 36.9 26.3 49.1 Pretrain / Fine-tune 🤗 HuggingFace / 🤖 ModelScope 🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM2-7B (XTuner) 73.3 74.6 71.7 72.0 42.5 1700 71.2 35.9 40.1 25.5 46.8 Pretrain / Fine-tune 🤗 HuggingFace / 🤖 ModelScope 🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM2-20B (XTuner) 75.1 73.5 73.7 72.8 46.3 1868 70.2 37.2 39.4 24.6 47.7 Pretrain / Fine-tune 🤗 HuggingFace / 🤖 ModelScope 🤗 HuggingFace / 🤖 ModelScope

When aligned completely with the official training settings, the results are as follows:

Model Framework MMBench Test (EN) MMBench Dev (EN) MMBench Test (CN) MMBench Dev (CN) CCBench Dev MME SEEDBench_IMG MMVet Configs
LLaVA-v1.5-7B Official 65.2 63.0 57.3 57.4 25.2 1775 65.6 32.7 -
LLaVA-v1.5-7B XTuner 68.6 68.0 61.5 61.4 26.5 1786 65.8 31.4 Pretrain / Fine-tune

Data Preparation

Please refer to the docs.

Training

The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning

Note: this guide takes 8-card training LLaVA-InternLM2-7B as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in ./work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth.

  1. Alignment module pretraining (saved by default in ./work_dirs/)
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
  1. Instruction following fine-tuning (saved by default in ./work_dirs/)
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2

Model Conversion (and Merge)

After training, we will obtain a set of weights (i.e., iter_xxx.pth), which are not in the universal HuggingFace format. We first need to convert them.

xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./iter_5198.pth ./iter_5198_hf

At this point, we have obtained the relevant model (LLM or the corresponding LoRA).

Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:

(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip

Chat

You can download the released LLaVA-InternLM2-7B model from 🤗 HuggingFace or 🤖 ModelScope, and achieve image-text question answering with the following command!

xtuner chat internlm/internlm2-chat-7b \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava xtuner/llava-internlm2-7b \
  --prompt-template internlm2_chat \
  --image $IMAGE_PATH

Here, --llava is the converted weight from the above step (in our example, it is ./iter_5198_hf ).

Evaluation

XTuner's LLaVA models can be evaluated using VLMEvalKit.

For convenience, XTuner also integrates the MMBench evaluation.

User can download the MMBench dataset with

wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv

After that, the evaluations can be run with

xtuner mmbench internlm/internlm2-chat-7b \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava xtuner/llava-internlm2-7b \
  --prompt-template internlm2_chat \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH

Here, $DATA_PATH refers to one of the datasets downloaded as mentioned above, such as MMBench_DEV_EN.tsv.

After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit mmbench_result.xlsx to the official MMBench for final evaluation to obtain precision results!

Refcoco

To evaluate your model with refcoco, you need download the evaluation data files in link. Second, you can use following command to evaluate your model.

xtuner eval_refcoco $LLM \
  --visual-encoder $VISUAL_ENCODER \
  --llava $LLAVA_PATH \
  --prompt-template $PROMPT_TEMPLATE \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH