-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot replicate DPO results of zephyr #124
Comments
Related to #45 |
I met similar question with you: My model gives ########## First turn ########## ########## Second turn ########## ########## Average ########## |
where the models ends with '-ref' is the official checkpoint from huggingface, and models ends with '-self' are my models when reproducing the experiment. |
Experiencing similar issues here. The replicated model scores about 0.3 lower than the published zephyr-7b-dpo-full. Reported in the blog post. Using FastChat's inference script with empty system message Trained with the repo In addition, the training statistics when training Zephyr-7B with beta=0.01 are very different from what's published. I checked against the published DPO training statistics (at epoch 0.84) of Zephyr-7b-dpo-full. Below I list the value in our training and in parenthesis I list the reported values.
The diff in reward/accuracies looks alarming. Any idea what could be the cause @lewtun? @AlexiaJM @xijiu9 let me know if you have any progress in replicating! Best, |
reward accuracy of 0.33 doesn't seem reasonable at epoch 0.84. |
I cannot replicate the DPO results for zephyr.
I use a modified version of config_full.yaml with the only difference being that I set gradient_accumulation_steps: 4 instead of 2 because I use 4 GPUs. I'm using all the correct versions of software as in setup.py. I resumed twice during training and its something that is inevitable with our cluster, but if resuming set seeds properly, this should not be a problem.
Code:
`ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=4 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full4.yaml
The results is here: https://huggingface.co/AlexiaJM/zephyr-7b-dpo-full-repnew. As you can see the numbers are slightly off from https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full but not significantly.
These are the results from the MT-Bench:
########## First turn ##########
score
model turn
zephyr-7b-dpo-full 1 7.81250
zephyr-7b-dpo-full-repnew 1 7.5375
########## Second turn ##########
score
model turn
zephyr-7b-dpo-full 2 7.322785
zephyr-7b-dpo-full-repnew 2 7.125
########## Average ##########
score
model
zephyr-7b-dpo-full 7.569182
zephyr-7b-dpo-full-repnew 7.33125
The text was updated successfully, but these errors were encountered: