-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
一些可能有用的分享(Sharing that might be helpful...) #188
Comments
跟着官方的提供的文档是「基本上」可以跑通的,可是存在以下一些问题: 权重无法加载 在第一步训练时没问题,在训练好 head nerf 之后,训练第二步会报错: Traceback (most recent call last):
File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 150, in fit
self.run_single_process(self.task)
File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 200, in run_single_process
model = task.build_model()
File "/home/featurize/GeneFacePlusPlus/./tasks/radnerfs/radnerf_torso_sr.py", line 69, in build_model
load_ckpt(head_model, hparams['head_model_dir'], strict=True)
File "/home/featurize/GeneFacePlusPlus/./utils/commons/ckpt_utils.py", line 67, in load_ckpt
cur_model.load_state_dict(state_dict, strict=strict)
File "/environment/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADNeRFwithSR:
size mismatch for blink_encoder.1.weight: copying a param with shape torch.Size([2, 32]) from checkpoint, the shape in current model is torch.Size([4, 32]).
size mismatch for blink_encoder.1.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([4]). 原因:这是因为默认的配置 解决:可以将 torso 中的 4 修改为 2 就可以正常加载。或者在训练 head 的时候将 2 更换为 4 也是可行的。 Following the official documentation, it is 'basically' possible to run it, but there are some issues: Weights cannot be loaded Traceback (most recent call last):
File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 150, in fit
self.run_single_process(self.task)
File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 200, in run_single_process
model = task.build_model()
File "/home/featurize/GeneFacePlusPlus/./tasks/radnerfs/radnerf_torso_sr.py", line 69, in build_model
load_ckpt(head_model, hparams['head_model_dir'], strict=True)
File "/home/featurize/GeneFacePlusPlus/./utils/commons/ckpt_utils.py", line 67, in load_ckpt
cur_model.load_state_dict(state_dict, strict=strict)
File "/environment/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADNeRFwithSR:
size mismatch for blink_encoder.1.weight: copying a param with shape torch.Size([2, 32]) from checkpoint, the shape in current model is torch.Size([4, 32]).
size mismatch for blink_encoder.1.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([4]). There's no problem with the first step of training, but an error occurs in the second step after training the head nerf: Reason: This is because the default configurations lm3d_radnerf_sr.yaml and lm3d_radnerf_torso_sr.yaml differ in the eye_blink_dim setting. Solution: You can change the 4 to 2 in the torso configuration to load it correctly. Alternatively, changing the 2 to 4 when training the head is also feasible. |
注意修改配置文件 这可能是在第一次运行时比较容易疏忽的地方,在训练自己的数据时,需要复制官方提供的 官方的文档中仅提到了将 Be sure to modify the configuration file This might be an easily overlooked aspect when running for the first time. When training with your own data, you need to copy the configuration directory provided by the official 'may', but further changes are also necessary. The official documentation only mentions changing the video_id in lm3d_radnerf.yaml to your own name, but this is not sufficient. A simple and effective method is: in the entire config directory you copied, search for the keywords 'May' and 'may', and replace both keywords with your own video_id. |
修改训练的命令 官方文档给出的训练命令中,同样包含有 may 关键字,这也是容易被疏忽的地方,例如:
注意上面这条命令中出现了 May 和 may_head,这里我们也将 May / may 替换成 video_id,这样才能方便后续的实验管理,以及实验的对比! 实际上,训练的代码可以这么优化一下: # train head nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
--exp_name=motion2video_nerf/${VIDEO_ID}_head \
--reset
# train torso nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${video_id}/lm3d_radnerf_torso_sr.yaml \
--exp_name=motion2video_nerf/${video}_id_torso \
--hparams=head_model_dir=checkpoints/motion2video_nerf/${video_id}_head \
--reset Modify the training command The training command provided in the official documentation also includes the keyword 'may', which is another point that can be easily overlooked. For example:
Note that in the command above, 'May' and 'may_head' appear. Here, we should also replace 'May' / 'may' with 'video_id' to facilitate subsequent experiment management and comparison of experiments! infact, you can write the training command in this way: # train head nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
--exp_name=motion2video_nerf/${VIDEO_ID}_head \
--reset
# train torso nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${video_id}/lm3d_radnerf_torso_sr.yaml \
--exp_name=motion2video_nerf/${video}_id_torso \
--hparams=head_model_dir=checkpoints/motion2video_nerf/${video_id}_head \
--reset |
配置中的一个错误 在 https://github.com/yerfor/GeneFacePlusPlus/blob/main/egs/datasets/May/lm3d_radnerf_torso_sr.yaml#L24 A Mistake in the Configuration In |
经过上面一系列的处理,我已经在自己的数据集上得到了不错的效果,Happy Tweaking : ) |
您好,感谢您的分享的资讯,对我帮助很大 |
您好,很感谢您的分享,由于本人菜菜,我想问一下怎么才能在训练中添加tensorboard去查看训练情况呢?想问下具体应该在哪个代码中进行修改呢 |
学习率有 scheduler 控制,可以直接通过调整最大训练的迭代数 |
源码本来就添加了 TensorBoard 的日志的,你只需要用 |
可以分享下你的效果吗?我试了后,人头总是忽大忽小,虽然幅度不大,但还是能明显感觉出来,这块你有解决方案不?另外,最近开源的 MuseTalk ,效果挺好,可以试试。 |
没有忽大忽小的情况,头挺稳定的,嘴形能对上,但是幅度比较小,我打算先从数据集上去优化。目前只训练了一两个内部人员模特的人脸模型,隐私性考虑就不分享了。
感谢分享,持续关注。 |
回贴头至原视频我看之前有的issue里面提了一些思路,看能分享一下更具体的实现吗? |
不好意思,数字人可能有一些误导,我目前只跑了 GeneFace++ 的部分,如何把头贴回原视频是我下一步开展的工作,包括实时的推流之类的,欢迎交流。 |
大佬知道怎么提高训练速度吗?比如在哪修改batch_size,我4张80G的卡用不了好难受 |
|
Many Thanks for your helpful comments. To get ride of this problem I rename my sample to May |
We have done over 100 hours of training per one video as experiment to see if results improve. There is no way choosing ''best'' model based on just few numbers during 5-10 hours training, generally they still improve even after 20+ hours. Comparing 10 hours training - 50 hours and 100 hours -> results always get better eventually, VERY little, but they are better. My argument is not based on 100% code provided by author but multiple changes to improve long term training and different improvements for our needs. Upscaling frames(topaz for example) after pre-processing own video can improve end results heavily. |
没什么好拖累的... 百来兆的 checkpoint,存一个1秒都不到 |
你好 我想问下,有没有比较好的解决眨眼的方式呢?训练出来的人物不会眨眼 |
同样不会眨眼,试了老版本是可以眨眼的 |
@louis-she 您好,请问一下训练过程中,训练集的分辨率是512*512,是否有方法可以训练更高分辨率的模型?(2k,4k) |
首先,感谢这篇论文代码的作者,提供了这样的 Code Base,以及个 Code Base 中所有基于 RAD Nerf 的代码,伟大的工作!
在经过几次实验后我发现,训练自己的数据的效果并非能完全达到官方的 Demo,尽管 Demo 一般来说都是精心挑选出来的,不过这依然能给我们很大的鼓励去达到 Demo 的效果。
官方的作者似乎并没有非常积极的回答 issue 中的问题,以及一些核心代码其实还未完全公开,我将在这个 issue 中完全说明我在使用过程中遇到的问题和挑战,以及我如何去解决的(可能还未解决),因此这是一个持续的过程。
First of all, thank you to the author of the paper's code for providing such a Code Base, as well as all the RAD Nerf-based code within this Code Base—great work!
After several experiments, I have found that training my own data does not quite achieve the results of the official demo. Although demos are generally carefully selected, this still greatly encourages us to reach the level of the demo.
It seems that the official author is not very active in responding to issues, and some of the core code has not yet been fully released. I will fully explain the problems and challenges I encountered during the use, and how I addressed them (some may still be unresolved), so this is an ongoing process.
另外,可以关注我的公众号持续讨论:
The text was updated successfully, but these errors were encountered: