Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

昇腾多卡训练问题 #3810

Open
1 task done
1737686924 opened this issue May 19, 2024 · 0 comments
Open
1 task done

昇腾多卡训练问题 #3810

1737686924 opened this issue May 19, 2024 · 0 comments
Labels
pending This problem is yet to be addressed.

Comments

@1737686924
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

脚本如下:
deepspeed --num_gpus 4 src/train_bash.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--model_name_or_path /data/applications/LMD-BF/backend/BaseModels/internlm2-chat-20b-sft/internlm2-chat-20b-sft/
--dataset identity
--template intern2
--finetuning_type lora
--lora_target wqkv
--output_dir saves/internlm2-chat-20b-sft/lora/sft
--overwrite_cache true
--overwrite_output_dir true
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 2
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 10.0
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
stage:0,1,2,3均报错应是爆显存
2ef85aebbc28dedf3a2b03c28471e72

开启offload训练在这一步卡住
ddf0f4dafef1723efcd6720214db913

应该如何解决。

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants