We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
脚本如下: deepspeed --num_gpus 4 src/train_bash.py --deepspeed examples/deepspeed/ds_z3_config.json --stage sft --do_train --model_name_or_path /data/applications/LMD-BF/backend/BaseModels/internlm2-chat-20b-sft/internlm2-chat-20b-sft/ --dataset identity --template intern2 --finetuning_type lora --lora_target wqkv --output_dir saves/internlm2-chat-20b-sft/lora/sft --overwrite_cache true --overwrite_output_dir true --cutoff_len 1024 --preprocessing_num_workers 16 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 2 --evaluation_strategy steps --load_best_model_at_end --learning_rate 5e-5 --num_train_epochs 10.0 --val_size 0.1 --ddp_timeout 180000000 --plot_loss --fp16 stage:0,1,2,3均报错应是爆显存
开启offload训练在这一步卡住
应该如何解决。
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
Reproduction
脚本如下:
deepspeed --num_gpus 4 src/train_bash.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--model_name_or_path /data/applications/LMD-BF/backend/BaseModels/internlm2-chat-20b-sft/internlm2-chat-20b-sft/
--dataset identity
--template intern2
--finetuning_type lora
--lora_target wqkv
--output_dir saves/internlm2-chat-20b-sft/lora/sft
--overwrite_cache true
--overwrite_output_dir true
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 2
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 10.0
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
stage:0,1,2,3均报错应是爆显存
开启offload训练在这一步卡住
应该如何解决。
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: