昇腾多卡训练问题 #3810

1737686924 · 2024-05-19T11:07:28Z

Reminder

I have read the README and searched the existing issues.

Reproduction

脚本如下：
deepspeed --num_gpus 4 src/train_bash.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--model_name_or_path /data/applications/LMD-BF/backend/BaseModels/internlm2-chat-20b-sft/internlm2-chat-20b-sft/
--dataset identity
--template intern2
--finetuning_type lora
--lora_target wqkv
--output_dir saves/internlm2-chat-20b-sft/lora/sft
--overwrite_cache true
--overwrite_output_dir true
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 2
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 10.0
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
stage:0,1,2,3均报错应是爆显存

开启offload训练在这一步卡住

应该如何解决。

Expected behavior

No response

System Info

No response

Others

No response

hiyouga added the pending This problem is yet to be addressed. label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

昇腾多卡训练问题 #3810

昇腾多卡训练问题 #3810

1737686924 commented May 19, 2024

昇腾多卡训练问题 #3810

昇腾多卡训练问题 #3810

Comments

1737686924 commented May 19, 2024

Reminder

Reproduction

Expected behavior

System Info

Others