Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed OOM list out of range bug #2550

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alexis-allemann
Copy link

Issue #2549

@vince62s
Copy link
Member

Thanks! I get your point but the only risk is to go with many OOM without realizing. What were the circumstances under which you faced all of this?

@alexis-allemann
Copy link
Author

I encountered this error during a standard experiment when training a translation model. In my experiment, I had set a batch size that almost completely filled the memory of my GPUs. After a few steps (about a thousand), a batch triggered an OOM error on one of my GPUs, causing the training process to abort. It's not very practical to interrupt the training process due to an infrequent OOM memory error.

I've made an update to my pull request to try and recalculate the gradients with a new batch, which seems to be a better approach than just filling the tensors with zeros. Perhaps we could consider adding an option to specify an allowed number of attempts before terminating the learning process? For example, opt.max_oom_batch_retries. What do you think about this?

@vince62s
Copy link
Member

vince62s commented Jan 12, 2024

you can try this approach but then run it with a knowingly batch size that is too big to trigger OOM, I think it is not bullet proof and will trigger exceptions, just saying but interested to know.
NB are you using sentence or token batch sizes?

@alexis-allemann
Copy link
Author

You're right, I've just tried it and it doesn't seem such a good idea because a timeout now occurs on the torch.distributed.all_gather method because a deadlock occurs between the processes...
To answer your other question, I use token batch sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants