Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

一些可能有用的分享(Sharing that might be helpful...) #188

Open
louis-she opened this issue Apr 27, 2024 · 30 comments
Open

一些可能有用的分享(Sharing that might be helpful...) #188

louis-she opened this issue Apr 27, 2024 · 30 comments

Comments

@louis-she
Copy link

louis-she commented Apr 27, 2024

首先,感谢这篇论文代码的作者,提供了这样的 Code Base,以及个 Code Base 中所有基于 RAD Nerf 的代码,伟大的工作!

在经过几次实验后我发现,训练自己的数据的效果并非能完全达到官方的 Demo,尽管 Demo 一般来说都是精心挑选出来的,不过这依然能给我们很大的鼓励去达到 Demo 的效果。

官方的作者似乎并没有非常积极的回答 issue 中的问题,以及一些核心代码其实还未完全公开,我将在这个 issue 中完全说明我在使用过程中遇到的问题和挑战,以及我如何去解决的(可能还未解决),因此这是一个持续的过程。


First of all, thank you to the author of the paper's code for providing such a Code Base, as well as all the RAD Nerf-based code within this Code Base—great work!

After several experiments, I have found that training my own data does not quite achieve the results of the official demo. Although demos are generally carefully selected, this still greatly encourages us to reach the level of the demo.

It seems that the official author is not very active in responding to issues, and some of the core code has not yet been fully released. I will fully explain the problems and challenges I encountered during the use, and how I addressed them (some may still be unresolved), so this is an ongoing process.

另外,可以关注我的公众号持续讨论:

image
@louis-she
Copy link
Author

跟着官方的提供的文档是「基本上」可以跑通的,可是存在以下一些问题:

权重无法加载

在第一步训练时没问题,在训练好 head nerf 之后,训练第二步会报错:

Traceback (most recent call last):
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 150, in fit
    self.run_single_process(self.task)
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 200, in run_single_process
    model = task.build_model()
  File "/home/featurize/GeneFacePlusPlus/./tasks/radnerfs/radnerf_torso_sr.py", line 69, in build_model
    load_ckpt(head_model, hparams['head_model_dir'], strict=True)
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/ckpt_utils.py", line 67, in load_ckpt
    cur_model.load_state_dict(state_dict, strict=strict)
  File "/environment/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADNeRFwithSR:
        size mismatch for blink_encoder.1.weight: copying a param with shape torch.Size([2, 32]) from checkpoint, the shape in current model is torch.Size([4, 32]).
        size mismatch for blink_encoder.1.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([4]).

原因:这是因为默认的配置 lm3d_radnerf_sr.yamllm3d_radnerf_torso_sr.yaml 中,eye_blink_dim 这一项是不一样的。

解决:可以将 torso 中的 4 修改为 2 就可以正常加载。或者在训练 head 的时候将 2 更换为 4 也是可行的。


Following the official documentation, it is 'basically' possible to run it, but there are some issues:

Weights cannot be loaded

Traceback (most recent call last):
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 150, in fit
    self.run_single_process(self.task)
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/trainer.py", line 200, in run_single_process
    model = task.build_model()
  File "/home/featurize/GeneFacePlusPlus/./tasks/radnerfs/radnerf_torso_sr.py", line 69, in build_model
    load_ckpt(head_model, hparams['head_model_dir'], strict=True)
  File "/home/featurize/GeneFacePlusPlus/./utils/commons/ckpt_utils.py", line 67, in load_ckpt
    cur_model.load_state_dict(state_dict, strict=strict)
  File "/environment/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADNeRFwithSR:
        size mismatch for blink_encoder.1.weight: copying a param with shape torch.Size([2, 32]) from checkpoint, the shape in current model is torch.Size([4, 32]).
        size mismatch for blink_encoder.1.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([4]).

There's no problem with the first step of training, but an error occurs in the second step after training the head nerf:

Reason: This is because the default configurations lm3d_radnerf_sr.yaml and lm3d_radnerf_torso_sr.yaml differ in the eye_blink_dim setting.

Solution: You can change the 4 to 2 in the torso configuration to load it correctly. Alternatively, changing the 2 to 4 when training the head is also feasible.

@louis-she
Copy link
Author

louis-she commented Apr 27, 2024

注意修改配置文件

这可能是在第一次运行时比较容易疏忽的地方,在训练自己的数据时,需要复制官方提供的 may 的配置目录,但还要进一步更改。

官方的文档中仅提到了将 lm3d_radnerf.yaml 中的 video_id 改成自己的名字,但这其实还不够。一个简单有效的方法是:在你复制出来的整个 config 目录下,搜索 May 和 may 这两个关键字,并且将这两个关键字全部替换为你自己的 video_id。


Be sure to modify the configuration file

This might be an easily overlooked aspect when running for the first time. When training with your own data, you need to copy the configuration directory provided by the official 'may', but further changes are also necessary.

The official documentation only mentions changing the video_id in lm3d_radnerf.yaml to your own name, but this is not sufficient. A simple and effective method is: in the entire config directory you copied, search for the keywords 'May' and 'may', and replace both keywords with your own video_id.

@louis-she
Copy link
Author

louis-she commented Apr 27, 2024

修改训练的命令

官方文档给出的训练命令中,同样包含有 may 关键字,这也是容易被疏忽的地方,例如:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config=egs/datasets/May/lm3d_radnerf_sr.yaml --exp_name=motion2video_nerf/may_head --reset

注意上面这条命令中出现了 May 和 may_head,这里我们也将 May / may 替换成 video_id,这样才能方便后续的实验管理,以及实验的对比!

实际上,训练的代码可以这么优化一下:

# train head nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
  --config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
  --exp_name=motion2video_nerf/${VIDEO_ID}_head \
  --reset

# train torso nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
  --config=egs/datasets/${video_id}/lm3d_radnerf_torso_sr.yaml \
  --exp_name=motion2video_nerf/${video}_id_torso \
  --hparams=head_model_dir=checkpoints/motion2video_nerf/${video_id}_head \
  --reset

Modify the training command

The training command provided in the official documentation also includes the keyword 'may', which is another point that can be easily overlooked. For example:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config=egs/datasets/May/lm3d_radnerf_sr.yaml --exp_name=motion2video_nerf/may_head --reset

Note that in the command above, 'May' and 'may_head' appear. Here, we should also replace 'May' / 'may' with 'video_id' to facilitate subsequent experiment management and comparison of experiments!

infact, you can write the training command in this way:

# train head nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
  --config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
  --exp_name=motion2video_nerf/${VIDEO_ID}_head \
  --reset

# train torso nerf
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
  --config=egs/datasets/${video_id}/lm3d_radnerf_torso_sr.yaml \
  --exp_name=motion2video_nerf/${video}_id_torso \
  --hparams=head_model_dir=checkpoints/motion2video_nerf/${video_id}_head \
  --reset

@louis-she
Copy link
Author

首先训练官方的 Demo,并使用 TensorBoard 查看结果,根据结果调整参数

大部分人可能都需要训练自己的数字人,但在这之前,一定要先训练一次官方给的 Demo。这样我们在 TensorBoard 中才有一个「榜样」可以去比对。否则可能就像一个无头苍蝇,效果不好也不知道哪里出了问题。

官方提供的代码在 TensorBard 中打印了非常多有用的信息,在训练好官方的 Demo 后,再训练我们自己的数据,就可以很容易地和官方的信息进行对比,例如:

image

黑线是官方的 May,紫色曲线是我自己的数据,发现了么?我的数据早就已经过拟合了!而官方的数据则没有,因此我可能需要调整训练的次数,学习率等。

如果所有的曲线能跟官方的曲线尽可能的匹配,那么可以说最后的效果应该可以和官方的 Demo 更加接近。


First, train the official demo and use TensorBoard to view results, then adjust parameters based on those results

Most people may need to train their own digital humans, but before doing that, it is essential to train using the official demo provided. This way, we have a "benchmark" in TensorBoard to compare against. Otherwise, it might feel like stumbling around without clear direction, and when results aren’t good, it’s hard to pinpoint the issue.

The official code logs a lot of useful information in TensorBoard. After training the official demo, when we train our own data, it becomes easy to compare with the official information. For example:

image

The black line represents the official 'May', and the purple curve is my own data. Notice something? My data is already overfitting! However, the official data is not, so I might need to adjust the number of training iterations, learning rate, etc.

If all the curves can match the official curves as closely as possible, then it can be said that the final results should be much closer to the official demo.

@louis-she
Copy link
Author

在训练 Torso NeRF 模型时选择正确的 Head NeRF 模型

官方代码中,默认存储了 3 个模型:

image

而在训练 Torso NeRF 的时候,需要加载 Head NeRF 模型,如果你直接使用的是官方的命令行,那么他始终会加载最后一个模型,但很明显,根据上面的曲线来看,最后一个模型并不是最好的。

因此在训练 Torso 的时候一定不要直接使用默认的命令,而是手动选择性能较好的 Head 模型。


Select the correct Head NeRF model when training the Torso NeRF model

In the official code, three models are stored by default:

image

During the training of the Torso NeRF, it is necessary to load a Head NeRF model. If you use the official command line directly, it will always load the last model. However, as can be clearly seen from the curves above, the last model is not the best one.

Therefore, when training the Torso, do not use the default command directly; instead, manually select a Head model that performs better.

@louis-she
Copy link
Author

保存最佳模型

官方源码中,把保存最好模型的代码注释掉了,建议重新打开,这样保存 validation_loss 最小的模型。

image

Save the Best Model

In the official source code, the code for saving the best model is commented out. It is recommended to uncomment it to save the model with the smallest validation loss.

image

@louis-she
Copy link
Author

配置中的一个错误

lm3d_radnerf_torso_sr.yaml 中,有一个 num_updates 选项,但这个配置似乎并没有被使用,对比 Head NeRF 的配置和他的值,这一项我猜应该是 max_updates

https://github.com/yerfor/GeneFacePlusPlus/blob/main/egs/datasets/May/lm3d_radnerf_torso_sr.yaml#L24


A Mistake in the Configuration

In lm3d_radnerf_torso_sr.yaml, there is an option called num_updates which seems to be unused. Comparing with the Head NeRF configuration and its values, I suspect this should be max_updates.

@louis-she
Copy link
Author

经过上面一系列的处理,我已经在自己的数据集上得到了不错的效果,Happy Tweaking : )

@richard28039
Copy link

您好,感谢您的分享的资讯,对我帮助很大
对于 "首先训练官方的 Demo,并使用 TensorBoard 查看结果,根据结果调整参数" 的部分我想请教,发现 Overfitting 以后学习率的部分具体会调整哪个值,谢谢

@abinggo
Copy link

abinggo commented Apr 28, 2024

您好,很感谢您的分享,由于本人菜菜,我想问一下怎么才能在训练中添加tensorboard去查看训练情况呢?想问下具体应该在哪个代码中进行修改呢

@louis-she
Copy link
Author

您好,感谢您的分享的资讯,对我帮助很大 对于 "首先训练官方的 Demo,并使用 TensorBoard 查看结果,根据结果调整参数" 的部分我想请教,发现 Overfitting 以后学习率的部分具体会调整哪个值,谢谢

学习率有 scheduler 控制,可以直接通过调整最大训练的迭代数max_updates达到一样的目的。

@louis-she
Copy link
Author

您好,很感谢您的分享,由于本人菜菜,我想问一下怎么才能在训练中添加tensorboard去查看训练情况呢?想问下具体应该在哪个代码中进行修改呢

源码本来就添加了 TensorBoard 的日志的,你只需要用 tensorboard 打开并查看就行了。

@chenkaiC4
Copy link

经过上面一系列的处理,我已经在自己的数据集上得到了不错的效果,Happy Tweaking : )

可以分享下你的效果吗?我试了后,人头总是忽大忽小,虽然幅度不大,但还是能明显感觉出来,这块你有解决方案不?另外,最近开源的 MuseTalk ,效果挺好,可以试试。

@louis-she
Copy link
Author

louis-she commented Apr 29, 2024

可以分享下你的效果吗?我试了后,人头总是忽大忽小,虽然幅度不大,但还是能明显感觉出来,这块你有解决方案不?

没有忽大忽小的情况,头挺稳定的,嘴形能对上,但是幅度比较小,我打算先从数据集上去优化。目前只训练了一两个内部人员模特的人脸模型,隐私性考虑就不分享了。

另外,最近开源的 MuseTalk ,效果挺好,可以试试。

感谢分享,持续关注。

@thorory
Copy link

thorory commented Apr 29, 2024

可以分享下你的效果吗?我试了后,人头总是忽大忽小,虽然幅度不大,但还是能明显感觉出来,这块你有解决方案不?

没有忽大忽小的情况,头挺稳定的,嘴形能对上,但是有一些。目前只训练了一两个内部人员模特的数字人,隐私性考虑就不分享了。

另外,最近开源的 MuseTalk ,效果挺好,可以试试。

感谢分享,持续关注。

回贴头至原视频我看之前有的issue里面提了一些思路,看能分享一下更具体的实现吗?

@louis-she
Copy link
Author

回贴头至原视频我看之前有的issue里面提了一些思路,看能分享一下更具体的实现吗?

不好意思,数字人可能有一些误导,我目前只跑了 GeneFace++ 的部分,如何把头贴回原视频是我下一步开展的工作,包括实时的推流之类的,欢迎交流。

@gushuaialan1
Copy link

gushuaialan1 commented Apr 29, 2024

回贴头至原视频我看之前有的issue里面提了一些思路,看能分享一下更具体的实现吗?

不好意思,数字人可能有一些误导,我目前只跑了 GeneFace++ 的部分,如何把头贴回原视频是我下一步开展的工作,包括实时的推流之类的,欢迎交流。

大佬知道怎么提高训练速度吗?比如在哪修改batch_size,我4张80G的卡用不了好难受

@xiao-keeplearning
Copy link

非常感谢您的分享!
关于lambda_ambient这个你在训练head阶段会越来越大么?参考这里的issue(#39) lambda_ambient似乎并不和模型效果挂钩。
在另一个issue上作者提到 validation_loss 最小的模型并不代表是最佳模型。 #25
你是怎么看的?

@louis-she
Copy link
Author

  1. 我这边的现象是:lambda_ambient 先会变大(到几百),之后会变小的,到 0.x 的级别
  2. 肯定的,首先 loss 由几个部分组合,其次这个东西没有标准的 metric 去衡量,只有肉眼观察,可以选择多存一些 Checkpoint(配置里有选项)

@CodeTilde
Copy link

注意修改配置文件

这可能是在第一次运行时比较容易疏忽的地方,在训练自己的数据时,需要复制官方提供的 may 的配置目录,但还要进一步更改。

官方的文档中仅提到了将 lm3d_radnerf.yaml 中的 video_id 改成自己的名字,但这其实还不够。一个简单有效的方法是:在你复制出来的整个 config 目录下,搜索 May 和 may 这两个关键字,并且将这两个关键字全部替换为你自己的 video_id。

Be sure to modify the configuration file

This might be an easily overlooked aspect when running for the first time. When training with your own data, you need to copy the configuration directory provided by the official 'may', but further changes are also necessary.

The official documentation only mentions changing the video_id in lm3d_radnerf.yaml to your own name, but this is not sufficient. A simple and effective method is: in the entire config directory you copied, search for the keywords 'May' and 'may', and replace both keywords with your own video_id.

Many Thanks for your helpful comments. To get ride of this problem I rename my sample to May

@CodeTilde
Copy link

在训练 Torso NeRF 模型时选择正确的 Head NeRF 模型

官方代码中,默认存储了 3 个模型:
image

而在训练 Torso NeRF 的时候,需要加载 Head NeRF 模型,如果你直接使用的是官方的命令行,那么他始终会加载最后一个模型,但很明显,根据上面的曲线来看,最后一个模型并不是最好的。

因此在训练 Torso 的时候一定不要直接使用默认的命令,而是手动选择性能较好的 Head 模型。

Select the correct Head NeRF model when training the Torso NeRF model

In the official code, three models are stored by default:
image

During the training of the Torso NeRF, it is necessary to load a Head NeRF model. If you use the official command line directly, it will always load the last model. However, as can be clearly seen from the curves above, the last model is not the best one.

Therefore, when training the Torso, do not use the default command directly; instead, manually select a Head model that performs better.

"manually select a Head model that performs better". How this should be done?

@CodeTilde
Copy link

在训练 Torso NeRF 模型时选择正确的 Head NeRF 模型
官方代码中,默认存储了 3 个模型:
image
而在训练 Torso NeRF 的时候,需要加载 Head NeRF 模型,如果你直接使用的是官方的命令行,那么他始终会加载最后一个模型,但很明显,根据上面的曲线来看,最后一个模型并不是最好的。
因此在训练 Torso 的时候一定不要直接使用默认的命令,而是手动选择性能较好的 Head 模型。
Select the correct Head NeRF model when training the Torso NeRF model
In the official code, three models are stored by default:
image
During the training of the Torso NeRF, it is necessary to load a Head NeRF model. If you use the official command line directly, it will always load the last model. However, as can be clearly seen from the curves above, the last model is not the best one.
Therefore, when training the Torso, do not use the default command directly; instead, manually select a Head model that performs better.

"manually select a Head model that performs better". How this should be done? I just keep the selected checkpoint in {vid_id}_head

@CodeTilde
Copy link

首先训练官方的 Demo,并使用 TensorBoard 查看结果,根据结果调整参数

大部分人可能都需要训练自己的数字人,但在这之前,一定要先训练一次官方给的 Demo。这样我们在 TensorBoard 中才有一个「榜样」可以去比对。否则可能就像一个无头苍蝇,效果不好也不知道哪里出了问题。

官方提供的代码在 TensorBard 中打印了非常多有用的信息,在训练好官方的 Demo 后,再训练我们自己的数据,就可以很容易地和官方的信息进行对比,例如:
image

黑线是官方的 May,紫色曲线是我自己的数据,发现了么?我的数据早就已经过拟合了!而官方的数据则没有,因此我可能需要调整训练的次数,学习率等。

如果所有的曲线能跟官方的曲线尽可能的匹配,那么可以说最后的效果应该可以和官方的 Demo 更加接近。

First, train the official demo and use TensorBoard to view results, then adjust parameters based on those results

Most people may need to train their own digital humans, but before doing that, it is essential to train using the official demo provided. This way, we have a "benchmark" in TensorBoard to compare against. Otherwise, it might feel like stumbling around without clear direction, and when results aren’t good, it’s hard to pinpoint the issue.

The official code logs a lot of useful information in TensorBoard. After training the official demo, when we train our own data, it becomes easy to compare with the official information. For example:
image

The black line represents the official 'May', and the purple curve is my own data. Notice something? My data is already overfitting! However, the official data is not, so I might need to adjust the number of training iterations, learning rate, etc.

If all the curves can match the official curves as closely as possible, then it can be said that the final results should be much closer to the official demo.

There are different loss functions in this work and their up-ward/down-ward trends are not aligned with each other. What does personal loss mean? which loss should we decides on for selecting proper check point?

@CodeTilde
Copy link

保存最佳模型

官方源码中,把保存最好模型的代码注释掉了,建议重新打开,这样保存 validation_loss 最小的模型。
image

Save the Best Model

In the official source code, the code for saving the best model is commented out. It is recommended to uncomment it to save the model with the smallest validation loss.
image

How is the best model selected based on this part of the code and different loss used in the work?

@lipsynthesis
Copy link

lipsynthesis commented May 2, 2024

We have done over 100 hours of training per one video as experiment to see if results improve. There is no way choosing ''best'' model based on just few numbers during 5-10 hours training, generally they still improve even after 20+ hours. Comparing 10 hours training - 50 hours and 100 hours -> results always get better eventually, VERY little, but they are better. My argument is not based on 100% code provided by author but multiple changes to improve long term training and different improvements for our needs.

Upscaling frames(topaz for example) after pre-processing own video can improve end results heavily.

@CodeTilde
Copy link

在训练 Torso NeRF 模型时选择正确的 Head NeRF 模型
官方代码中,默认存储了 3 个模型:
image
而在训练 Torso NeRF 的时候,需要加载 Head NeRF 模型,如果你直接使用的是官方的命令行,那么他始终会加载最后一个模型,但很明显,根据上面的曲线来看,最后一个模型并不是最好的。
因此在训练 Torso 的时候一定不要直接使用默认的命令,而是手动选择性能较好的 Head 模型。
Select the correct Head NeRF model when training the Torso NeRF model
In the official code, three models are stored by default:
image
During the training of the Torso NeRF, it is necessary to load a Head NeRF model. If you use the official command line directly, it will always load the last model. However, as can be clearly seen from the curves above, the last model is not the best one.
Therefore, when training the Torso, do not use the default command directly; instead, manually select a Head model that performs better.
Solved

@louis-she
Copy link
Author

我这里感觉有问题的一个点是训练的时候发现 log 是:

| Validation results@174000: {'total_loss': 693.9787731934, 'mse_loss': 0.0061687502, 'sr_mse_loss': 0.006660203, 'lambda_ambient': 693.9659423828} 05/06 09:19:19 PM Epoch 00000@174000: saving model to checkpoints/motion2video_nerf/syz2_head/model_ckpt_steps_174000.ckpt 05/06 09:19:19 PM Delete ckpt: model_ckpt_steps_172000.ckpt

似乎默认是每 2000 step保存一次中间结果?这样磁盘操作是不是太拖累训练时长了,这里感觉提供一下设置参数合理一些

没什么好拖累的... 百来兆的 checkpoint,存一个1秒都不到

@abinggo
Copy link

abinggo commented May 7, 2024

你好 我想问下,有没有比较好的解决眨眼的方式呢?训练出来的人物不会眨眼

@pengpengzi
Copy link

你好 我想问下,有没有比较好的解决眨眼的方式呢?训练出来的人物不会眨眼

同样不会眨眼,试了老版本是可以眨眼的

@thorory
Copy link

thorory commented May 13, 2024

@louis-she 您好,请问一下训练过程中,训练集的分辨率是512*512,是否有方法可以训练更高分辨率的模型?(2k,4k)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants