cannot replicate DPO results of zephyr #124

AlexiaJM · 2024-02-23T15:24:06Z

I cannot replicate the DPO results for zephyr.

I use a modified version of config_full.yaml with the only difference being that I set gradient_accumulation_steps: 4 instead of 2 because I use 4 GPUs. I'm using all the correct versions of software as in setup.py. I resumed twice during training and its something that is inevitable with our cluster, but if resuming set seeds properly, this should not be a problem.

Code:
`ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=4 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full4.yaml

The results is here: https://huggingface.co/AlexiaJM/zephyr-7b-dpo-full-repnew. As you can see the numbers are slightly off from https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full but not significantly.

These are the results from the MT-Bench:

########## First turn ##########
score
model turn
zephyr-7b-dpo-full 1 7.81250
zephyr-7b-dpo-full-repnew 1 7.5375

########## Second turn ##########
score
model turn
zephyr-7b-dpo-full 2 7.322785
zephyr-7b-dpo-full-repnew 2 7.125

########## Average ##########
score
model
zephyr-7b-dpo-full 7.569182
zephyr-7b-dpo-full-repnew 7.33125

AlexiaJM · 2024-02-23T19:00:41Z

Related to #45

xijiu9 · 2024-02-24T15:03:56Z

I met similar question with you: My model gives

########## First turn ##########
score
model turn
zephyr-7b-dpo-full-self-ref 1 7.79375
zephyr-7b-dpo-full-self 1 7.43750
zephyr-7b-sft-full-self-ref 1 6.63125
zephyr-7b-sft-full-self 1 6.39375

########## Second turn ##########
score
model turn
zephyr-7b-dpo-full-self-ref 2 7.35000
zephyr-7b-dpo-full-self 2 6.69375
zephyr-7b-sft-full-self-ref 2 5.97500
zephyr-7b-sft-full-self 2 5.61250

########## Average ##########
score
model
zephyr-7b-dpo-full-self-ref 7.571875
zephyr-7b-dpo-full-self 7.065625
zephyr-7b-sft-full-self-ref 6.303125
zephyr-7b-sft-full-self 6.003125

xijiu9 · 2024-02-24T15:05:49Z

where the models ends with '-ref' is the official checkpoint from huggingface, and models ends with '-self' are my models when reproducing the experiment.

EriChen0615 · 2024-03-04T21:16:57Z

Experiencing similar issues here. The replicated model scores about 0.3 lower than the published zephyr-7b-dpo-full.

Reported in the blog post.
1 Zephyr-7B-sft 6.24 from HF tutorial
2 Zephyr-7b-dpo-full 7.50 from HF tutorial

Using FastChat's inference script with empty system message
3 Zephyr-7B-sft 6.42
4 Zephyr-7b-dpo-full 7.48

Trained with the repo
5 Zephyr-7b-dpo-beta=0.01 7.16

In addition, the training statistics when training Zephyr-7B with beta=0.01 are very different from what's published. I checked against the published DPO training statistics (at epoch 0.84) of Zephyr-7b-dpo-full. Below I list the value in our training and in parenthesis I list the reported values.

Training Loss: 0.6008 (0.4853)
Validation Loss: 0.6014 (0.5050)
Rewards/accuracies: 0.3313 (0.7539)
Rewards/margins: 0.3653 (1.0156)

The diff in reward/accuracies looks alarming. Any idea what could be the cause @lewtun?

@AlexiaJM @xijiu9 let me know if you have any progress in replicating!

Best,
Eric

gxxu-ml · 2024-04-21T21:24:23Z

reward accuracy of 0.33 doesn't seem reasonable at epoch 0.84.
While there's is still a rewards/margins of 0.36? even though model is way more likely to guess wrong than guess right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot replicate DPO results of zephyr #124

cannot replicate DPO results of zephyr #124

AlexiaJM commented Feb 23, 2024

AlexiaJM commented Feb 23, 2024

xijiu9 commented Feb 24, 2024

xijiu9 commented Feb 24, 2024

EriChen0615 commented Mar 4, 2024 •

edited

gxxu-ml commented Apr 21, 2024 •

edited

cannot replicate DPO results of zephyr #124

cannot replicate DPO results of zephyr #124

Comments

AlexiaJM commented Feb 23, 2024

AlexiaJM commented Feb 23, 2024

xijiu9 commented Feb 24, 2024

xijiu9 commented Feb 24, 2024

EriChen0615 commented Mar 4, 2024 • edited

gxxu-ml commented Apr 21, 2024 • edited

EriChen0615 commented Mar 4, 2024 •

edited

gxxu-ml commented Apr 21, 2024 •

edited