Question: Gradient Accumulation #607

thiagolaitz · 2024-04-19T14:46:50Z

Hello, does it support gradient accumulation or microbatches like those in the T5X repository? I didn't find a parameter for this in base.yml, maybe I just didn't see it? Thank you!

rwitten · 2024-04-22T20:42:37Z

We don't support that out of the box. We've found that tuning LR to be smaller is a better approach.

What is your use case?

thiagolaitz · 2024-04-22T21:20:30Z

I'm training bigger models than before, so I can't use the same batch size on the same TPU. Got any recommended ablation studies on using gradient accumulation versus lowering the LR? Also, if I skip gradient accumulation, should I just linearly reduce the LR based on the batch size? Thanks!

rodrigo-f-nogueira · 2024-04-25T10:15:57Z

+1
Adding another use case: considering that the availability of TPUs vary, we encounter situations where we initially train a model with a v4-128 TPU but later need to replicate the experiment with a v4-64 TPU, which has less memory. Thus, we must use gradient accumulation to maintain consistency in the results.

hxssgaa · 2024-05-10T09:10:04Z

Simply add following code after allocation of optimizer in optimizers.py support the gradient accumulation:

if config.accumulate_gradient_steps > 1:
    optimizer = optax.MultiSteps(optimizer, config.accumulate_gradient_steps)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Gradient Accumulation #607

Question: Gradient Accumulation #607

thiagolaitz commented Apr 19, 2024

rwitten commented Apr 22, 2024

thiagolaitz commented Apr 22, 2024

rodrigo-f-nogueira commented Apr 25, 2024

hxssgaa commented May 10, 2024

Question: Gradient Accumulation #607

Question: Gradient Accumulation #607

Comments

thiagolaitz commented Apr 19, 2024

rwitten commented Apr 22, 2024

thiagolaitz commented Apr 22, 2024

rodrigo-f-nogueira commented Apr 25, 2024

hxssgaa commented May 10, 2024