Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO训练问题 #41

Open
chanel111 opened this issue May 6, 2024 · 5 comments
Open

DPO训练问题 #41

chanel111 opened this issue May 6, 2024 · 5 comments

Comments

@chanel111
Copy link

dpo训练小白想请教下大家,我用llama-3-8b-instruct尝试进行dpo训练,数据是从hf上找的中文和英文的dpo数据,训练了4个epoch之后loss已经降到0.1左右,进行测试,模型效果不仅没有提升还出现各种各样的问题,甚至问dpo训练集里的都会出现重复瞎答的现象

下面是我训练的代码,不知道是不是哪里出现bug

import torch

from transformers import AutoTokenizer, TrainingArguments, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig

output_dir = "./llama3_dpo_lora_result/"
model_name = "./Meta-Llama-3-8B-Instruct/"

dataset = load_dataset("json", data_files="./dpo_train_data_sample.json", split="train")

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='auto',
quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def print_trainable_parameters(input_model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in input_model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainables%: {100 * trainable_params / all_param}"
)

def return_prompt_and_responses(samples):
return {
"prompt": [
f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
for input in samples["prompt"]
],
"chosen": [
f"{chose}<|eot_id|>" for chose in samples["chosen"]
],
"rejected": [
f"{reject}<|eot_id|>" for reject in samples["rejected"]
],
}

original_columns = dataset.column_names

dataset = dataset.map(
return_prompt_and_responses,
batched=True,
remove_columns=original_columns
)

peft_config = LoraConfig(
lora_alpha=256,
lora_dropout=0.05,
r=128,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing =True,
max_grad_norm= 0.3,
num_train_epochs=8,
save_steps= 500,
learning_rate=2e-6,
bf16=True,
save_total_limit=6,
logging_steps=10,
output_dir=output_dir,
optim="paged_adamw_32bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05,
remove_unused_columns=False
)

print_trainable_parameters(model)

dpo_trainer = DPOTrainer(
model,
ref_model=None,
peft_config=peft_config,
args=training_args,
beta=0.5,
train_dataset=dataset,
tokenizer=tokenizer,
max_prompt_length=1024,
max_length=2048,
)

dpo_trainer.train()
dpo_trainer.save_model(output_dir)

dpo_trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

@CrazyBoyM
Copy link
Owner

可能是数据集质量问题,https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji
试试用这个数据集,另外提醒一下不要重复多个epoch,beta系数的调整实验也很关键。

@chanel111
Copy link
Author

可能是数据集质量问题,https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji 试试用这个数据集,另外提醒一下不要重复多个epoch,beta系数的调整实验也很关键。

感谢回复!想问一下就是我从dpo训练集里抽了几条数据测试我训练好的模型,但模型并没有按照训练的chosen答案进行回答(当然也没有按照rejected的去答),感觉和未训练过的模型答的差别不太大,这种情况是正常的吗,还是说loss收敛之后训练集的问题应该完全按照chosen去答(之前一直做SFT,对dpo不太了解)

另外可以请教下一般dpo训练多少epoch为好,loss降到什么值效果比较好,rewards/margins能够作为衡量模型效果的指标吗,还有像beta等其他参数有什么调整策略吗?

@CrazyBoyM
Copy link
Owner

rewards/margins可以作为参照但不必要,beta从0.05到0.5之间的调整还是很有必要实验的,一般只推荐训练 1 个epoch即可

@chanel111
Copy link
Author

rewards/margins可以作为参照但不必要,beta从0.05到0.5之间的调整还是很有必要实验的,一般只推荐训练 1 个epoch即可

好的,感谢~

@james-yw
Copy link

james-yw commented Jun 5, 2024

@CrazyBoyM @chanel111 你好,请问一下你们在使用LLama3-Instruction直接在中文数据上进行DPO的过程中,有遇到DPO训练过后的模型response会出现生成重复的这种现象吗,有通用的稳定的解决方案吗?谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants