Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: paddle.distributed.launch 启动多进程训练结束后Loading best model from checkpoint 报错 #8429

Open
jazzly opened this issue May 13, 2024 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@jazzly
Copy link

jazzly commented May 13, 2024

请提出你的问题

使用示例 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme
进程训练时,使用

python3 -m paddle.distributed.launch --nproc_per_node=24 train.py \
    --do_train \
    --do_eval \
    --do_export \
    --model_name_or_path ernie-3.0-tiny-medium-v2-zh \
    --output_dir checkpoint \
    --device cpu \
    --num_train_epochs 100 \
    --early_stopping True \
    --early_stopping_patience 5 \
    --learning_rate 3e-5 \
    --max_length 128 \
    --per_device_eval_batch_size 32 \
    --per_device_train_batch_size 32 \
    --metric_for_best_model accuracy \
    --load_best_model_at_end \
    --logging_steps 5 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --save_total_limit 3

开启多进程并行,在训练完成的时候加载结果会报如下错误。 是我使用的方式不对吗?CPU模式下开启多进程或多线程同时计算应该用什么命令正确开启? 官方文档里没有查到,参数里面也没有明确的选项,使用enable_auto_parallel参数报错。见#8428

[2024-05-13 12:49:22,098] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:49:22)
[2024-05-13 12:55:42,547] [ INFO] - ***** Running Evaluation *****
[2024-05-13 12:55:42,548] [ INFO] - Num examples = 1955
[2024-05-13 12:55:42,548] [ INFO] - Total prediction steps = 3
[2024-05-13 12:55:42,548] [ INFO] - Pre device batch size = 32
[2024-05-13 12:55:42,548] [ INFO] - Total Batch size = 768
[2024-05-13 12:55:56,791] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:55:56)
[2024-05-13 12:55:56,791] [ INFO] -
Training completed.

[2024-05-13 12:55:56,805] [ INFO] - Loading best model from checkpoint/checkpoint-170 (score: 0.8204603580562659).
[2024-05-13 12:55:57,120] [ INFO] - set state-dict :([], [])
Traceback (most recent call last):
File "train.py", line 230, in
main()
File "train.py", line 185, in main
shutil.rmtree(checkpoint_path)
File "/usr/lib/python3.8/shutil.py", line 715, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/usr/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/usr/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tokenizer_config.json'

@jazzly jazzly added the question Further information is requested label May 13, 2024
@jazzly
Copy link
Author

jazzly commented May 13, 2024

使用的版本如下:

  • paddlepaddle: 2.6.1
  • paddlenlp: 2.8.0

@w5688414
Copy link
Contributor

可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:

load_best_model_at_end

@jazzly
Copy link
Author

jazzly commented May 14, 2024

可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:

load_best_model_at_end

是这样的,如果要使用early_stopping ,那么load_best_model_at_end是必须项。当报这个错的时候,类似checkpoint-170这种目录已经不存在了。我查看worklog发现,其实训练已经完成了。但是可能是多进程开启的原因,每个进程都想load_best_model_at_end。所以只有一个进程能成功。其它的进程应该都失败了。

python3 -m paddle.distributed.launch --nproc_per_node=24

这样是正确开启多进程的方式吗? 在CPU模式下

@w5688414
Copy link
Contributor

w5688414 commented May 14, 2024

不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch

--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

@jazzly
Copy link
Author

jazzly commented May 14, 2024

不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch

--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

暂时手头没有GPU可用,使用CPU测试的。 示例任务使用24个CPU核心训练大概4个小时不到就够了。还可一用。我的意思是,CPU模式如果不用 paddle.distributed.launch 那么应该如何正确开启多线程或多进程训练?

@w5688414
Copy link
Contributor

这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

@jazzly
Copy link
Author

jazzly commented May 15, 2024

这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

OK,明白了。感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants