-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low GPU Utilization during training #217
Comments
Yes @lucasgris I am using accelerate and have played around with num_workers. Even in the graph you shared, the util hits very low points (<25% GPU util) consistently, any luck with improving that? |
Not yet, but I think it is worth trying to identify where the code is slow, if I have any updates I will share here. |
Also having this problem with StyleTTS2/train_finetune_accelerate.py Lines 449 to 464 in 5cedc71
|
I tried the following options one by one:
|
@borrero-c thanks for looking into this, I didn't seem to observe anything changing after 1 epoch, it stays low for me. Also |
Looked into it some more, my steps are taking 40-20 seconds long and the When the training starts to pick up after that first epoch (and GPU is being more consistently utilized) the steps are ~4 seconds each and the backwards call takes ~2 seconds. Also interesting to see that this code block is taking a good amount of time to complete too: StyleTTS2/train_finetune_accelerate.py Lines 306 to 312 in 5cedc71
It seems for each step ~25% of time is spent in the loop above and ~50% is spent in the |
Hi, I have been trying to train a StyleTTS2 model from scratch on the LibriTTS 460 dataset, currently going through the first stage via
train_first.py
The GPU utilisation of the training is very low ~30%. I am using a single H100 with
batch_size = 8
andmax_len = 300
to fit it on a single GPU.Such low util means that the script is not using the GPU effeciently and there are potential bottlenecks to be addressed which can make the training faster.
Has anyone observed similar issues while training the model from scratch or has any ideas for improving the GPU util.
cc @yl4579
The text was updated successfully, but these errors were encountered: