Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix policy loss gradient in TD3 #249

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ethanluoyc
Copy link
Contributor

The gradients dq_da is currently incorrect. The gradients for each dimension from the action should be summed instead of averaged as per

https://github.com/deepmind/rlax/blob/master/rlax/_src/policy_gradients_test.py#L55

For example, the D4PG agent also doesn't sum over the action dimension. In the case where someone may wish to write a D4PG-BC agent, this may cause a similar issue.

The current policy loss is fine if using this version to run online TD3. When using an optimizer such as Adam, which normalizes the gradients based on the magnitude, the constant should not affect the computed policy updates.
However, the current policy loss computation can be problematic if the user wants to use Acme's version of TD3 to reproduce results from the TD3-BC paper using the TD3-BC paper's default bc_alpha hyperparameter (which is 2.5). Without the sum, the relative magnitude of the gradient from the critic and the bc loss is different compared to the original implementation. I have noticed that this version of TD3 performs badly on some of the D4RL locomotion datasets (e.g., hopper-medium-replay-v2). I have found that without summing over the action dimension, the evaluation return is very unstable.

The gradients dq_da is currently incorrect. The gradients for each dimension from the action should be summed instead of averaged as per

https://github.com/deepmind/rlax/blob/master/rlax/_src/policy_gradients_test.py#L55

For example, the D4PG agent also doesn't sum over the action dimension. In the case where someone may wish to write a D4PG-BC agent, this may cause a similar issue.

The current policy loss is fine if using this version to run online TD3. When using an optimizer such as Adam, which normalizes the gradients based on the magnitude, the constant should not affect the computed policy updates.
However, the current policy loss computation can be problematic if the user wants to use Acme's version of TD3 to reproduce results from the TD3-BC paper using the TD3-BC paper's default `bc_alpha` hyperparameter (which is 2.5). Without the sum, the relative magnitude of the gradient from the critic and the bc loss is different compared to the original implementation. I have noticed that this version of TD3 performs badly on some of the D4RL locomotion datasets (e.g., hopper-medium-replay-v2). I have found that without summing over the action dimension, the evaluation return is very unstable.
@ethanluoyc
Copy link
Contributor Author

@alexis-jacq FYI since we had a discussion last time about some implementation differences between this TD3 and the official PyTorch implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant