-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: paddle.distributed.launch 启动多进程训练结束后Loading best model from checkpoint 报错 #8429
Comments
使用的版本如下:
|
可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:
|
是这样的,如果要使用early_stopping ,那么load_best_model_at_end是必须项。当报这个错的时候,类似checkpoint-170这种目录已经不存在了。我查看worklog发现,其实训练已经完成了。但是可能是多进程开启的原因,每个进程都想load_best_model_at_end。所以只有一个进程能成功。其它的进程应该都失败了。 python3 -m paddle.distributed.launch --nproc_per_node=24 这样是正确开启多进程的方式吗? 在CPU模式下 |
不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch
|
暂时手头没有GPU可用,使用CPU测试的。 示例任务使用24个CPU核心训练大概4个小时不到就够了。还可一用。我的意思是,CPU模式如果不用 paddle.distributed.launch 那么应该如何正确开启多线程或多进程训练? |
这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档: https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html |
OK,明白了。感谢 |
请提出你的问题
使用示例 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme
进程训练时,使用
开启多进程并行,在训练完成的时候加载结果会报如下错误。 是我使用的方式不对吗?CPU模式下开启多进程或多线程同时计算应该用什么命令正确开启? 官方文档里没有查到,参数里面也没有明确的选项,使用enable_auto_parallel参数报错。见#8428
[2024-05-13 12:49:22,098] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:49:22)
[2024-05-13 12:55:42,547] [ INFO] - ***** Running Evaluation *****
[2024-05-13 12:55:42,548] [ INFO] - Num examples = 1955
[2024-05-13 12:55:42,548] [ INFO] - Total prediction steps = 3
[2024-05-13 12:55:42,548] [ INFO] - Pre device batch size = 32
[2024-05-13 12:55:42,548] [ INFO] - Total Batch size = 768
[2024-05-13 12:55:56,791] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:55:56)
[2024-05-13 12:55:56,791] [ INFO] -
Training completed.
[2024-05-13 12:55:56,805] [ INFO] - Loading best model from checkpoint/checkpoint-170 (score: 0.8204603580562659).
[2024-05-13 12:55:57,120] [ INFO] - set state-dict :([], [])
Traceback (most recent call last):
File "train.py", line 230, in
main()
File "train.py", line 185, in main
shutil.rmtree(checkpoint_path)
File "/usr/lib/python3.8/shutil.py", line 715, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/usr/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/usr/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tokenizer_config.json'
The text was updated successfully, but these errors were encountered: