-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: 使用amp_master_grad的同时开启recompute,weight没有main_grad #8365
Comments
这里为什么是拆开写的?试试下面的写法?
|
@GuoxiaWang 因为这个issue里面我关心的重点是:开启recompute的时候ctx.save_for_backward(weight)这种写法会遇到backward中的weight没有main_grad的问题。 你说的这种写法是fused_layer.py中原本的写法,我也测试过,开启recompute=full后会遇到下面这个奇怪的错误,这就需要开另外一个issue了。
|
更新一下,recompute设置reentrant=True,可以避开这个bug。仅reentrant = False会遇到这个bug。 |
@Xreki 麻烦帮忙找Paddle这边熟悉recompute的同学看一下 |
![image](https://github.com@Wong4j PaddlePaddle/PaddleNLP/assets/12538138/84258d77-048e-41a2-9641-6d7a303ba6bf) @Wong4j 这个倒是 |
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
以llama训练为例
--amp_master_grad true
开启main_grad--recompute true --recompute_granularity full
来开启recompute,--enable_linear_fused_grad_add true
来调用llm/llama/fused_layers.py。因为这个问题是我在开发一个类似linear_fused_grad_add的功能时发现的。修改fused_layers.py #L32-L41的代码为:
运行llama训练,backward就会报weight没有main_grad
而如果不使用
ctx.save_for_backward
和ctx.saved_tensor()
,用ctx.weight=weight
和weight=ctx.weight
替代,则weight会有main_grad。我debug发现,这大概是因为在开启recompute时,save_for_backward会触发recompute.py#L340这里的拷贝,将weight拷贝给一个名为weight.name+"cpy"的tensor,但并没有拷贝main_grad。
The text was updated successfully, but these errors were encountered: