Fixed OOM list out of range bug #2550

alexis-allemann · 2024-01-11T10:26:01Z

vince62s · 2024-01-11T17:10:36Z

Thanks! I get your point but the only risk is to go with many OOM without realizing. What were the circumstances under which you faced all of this?

alexis-allemann · 2024-01-12T13:21:00Z

I encountered this error during a standard experiment when training a translation model. In my experiment, I had set a batch size that almost completely filled the memory of my GPUs. After a few steps (about a thousand), a batch triggered an OOM error on one of my GPUs, causing the training process to abort. It's not very practical to interrupt the training process due to an infrequent OOM memory error.

I've made an update to my pull request to try and recalculate the gradients with a new batch, which seems to be a better approach than just filling the tensors with zeros. Perhaps we could consider adding an option to specify an allowed number of attempts before terminating the learning process? For example, opt.max_oom_batch_retries. What do you think about this?

vince62s · 2024-01-12T13:28:15Z

you can try this approach but then run it with a knowingly batch size that is too big to trigger OOM, I think it is not bullet proof and will trigger exceptions, just saying but interested to know.
NB are you using sentence or token batch sizes?

alexis-allemann · 2024-01-12T13:53:02Z

You're right, I've just tried it and it doesn't seem such a good idea because a timeout now occurs on the torch.distributed.all_gather method because a deadlock occurs between the processes...
To answer your other question, I use token batch sizes.

…h zeros tensors

Fixed OOM list out of range bug

2511e95

Retry with a new batch if an OOM error occurs

c77fdaf

alexis-allemann added 2 commits January 19, 2024 09:31

Reverted last commit

43df9b9

Added max_consecutive_oom_errors option and filled None gradients wit…

174c970

…h zeros tensors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed OOM list out of range bug #2550

Fixed OOM list out of range bug #2550

alexis-allemann commented Jan 11, 2024

vince62s commented Jan 11, 2024

alexis-allemann commented Jan 12, 2024

vince62s commented Jan 12, 2024 •

edited

alexis-allemann commented Jan 12, 2024

Fixed OOM list out of range bug #2550

Are you sure you want to change the base?

Fixed OOM list out of range bug #2550

Conversation

alexis-allemann commented Jan 11, 2024

vince62s commented Jan 11, 2024

alexis-allemann commented Jan 12, 2024

vince62s commented Jan 12, 2024 • edited

alexis-allemann commented Jan 12, 2024

vince62s commented Jan 12, 2024 •

edited