Loss calculation across GPUs using all_gather_with_grad function #694

AlephZr · 2024-04-25T08:47:57Z

The code uses the all_gather_with_grad function to collect the tensor and gradient on all GPUs in order to compute the comparison loss across GPUs.
I can successfully train the BLIP-2 model using this function. But when I use it on my model, the model gets stuck after a certain number of iterations and neither reports an error nor continues training. And the memory and RAM are normal.
Is there a detail I'm missing?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss calculation across GPUs using all_gather_with_grad function #694

Loss calculation across GPUs using all_gather_with_grad function #694

AlephZr commented Apr 25, 2024

Loss calculation across GPUs using all_gather_with_grad function #694

Loss calculation across GPUs using all_gather_with_grad function #694

Comments

AlephZr commented Apr 25, 2024