Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

Open
1 task done
jazzly opened this issue May 13, 2024 · 0 comments
Open
1 task done

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

jazzly opened this issue May 13, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@jazzly
Copy link

jazzly commented May 13, 2024

软件环境

- paddlepaddle: 2.6.1
- paddlepaddle-gpu: 
- paddlenlp: 2.8.0

重复问题

  • I have searched the existing issues

错误描述

使用如上版本训练 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme
示例中的训练数据时,使用CPU模式时由于默认命令只使用单线程训练。想加快训练进程,查看了有一个enable_auto_parallel参数,当把这个 enable_auto_parallel 置为True时,启动训练会报get_rank_by_dim_and_process_id 函数找不到。

Traceback (most recent call last):
  File "train.py", line 230, in <module>
    main()
  File "train.py", line 166, in main
    trainer = Trainer(
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 388, in __init__
    self.print_config()
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 3058, in print_config
    v = getattr(args, a)
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/training_args.py", line 1524, in data_parallel_rank
    return mesh.get_rank_by_dim_and_process_id("dp", dist.get_rank())
AttributeError: 'ProcessMesh' object has no attribute 'get_rank_by_dim_and_process_id'

稳定复现步骤 & 代码

训练时启用 enable_auto_parallel参数
python3 train.py
--do_train
--do_eval
--do_export
--model_name_or_path ernie-3.0-tiny-medium-v2-zh
--output_dir checkpoint
--device cpu
--num_train_epochs 100
--early_stopping True
--early_stopping_patience 5
--learning_rate 3e-5
--max_length 128
--per_device_eval_batch_size 32
--per_device_train_batch_size 32
--metric_for_best_model accuracy
--load_best_model_at_end
--logging_steps 5
--evaluation_strategy epoch
--save_strategy epoch
--save_total_limit 3
--enable_auto_parallel True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants