Add a jupyter file, for try largest batch #23

HLSS-Hen · 2023-10-10T17:15:40Z

#2 , the VRAM leak issue.
This is not a memory leak, but rather a result of a significant disparity in the size of input samples. Let's take the processed training data file size as an example. There are only 110 samples larger than 500KB and fewer than 20,000 samples larger than 300KB (considering there are nearly 200,000 samples in total).

During the training process, batches are randomly composed (batch_size << samples). In extreme cases, the largest samples can easily trigger an OOM error. The seemingly leaking memory is actually caused by the generation of larger batches. If you call torch.cuda.empty_cache(), you will see that the actual memory usage fluctuates.

I've written this Jupyter script to help you confirm if your GPU can handle extremely large batch. By manually adjusting the BATCH_SIZE, you can determine the appropriate size for training. However, I cannot guarantee how much space should be reserved as safe, as someone mentioned experiencing an OOM error with a batch size of 2 on a 24GB GPU, while this script was able to run successfully.

ChengkaiYang · 2023-11-19T09:40:19Z

I wonder if the problem occurs at nearly all models' training process during training Argoverse2?Does it indicates that training model on Argoverse2 requires computation power?Has anyone ever try to reduce the batchsize and the learning rate?Can we get the same experiment result as the paper has mentioned?

ZikangZhou · 2023-11-23T09:10:18Z

Hi @HLSS-Hen,

Thanks for contributing to this repo! I'm sorry for not getting back to you sooner. Could you please add a section in the README.md to briefly illustrate the usage of this script so that people who come across the OOM issue can know how to use the script? Thanks!

HLSS-Hen · 2023-11-23T10:40:43Z

@ZikangZhou , I'm not good at writing, you may revise it.

Treat License cell as the 0-th Jupyter cell, users need to correctly fill in the CUDA Device used (torch.cuda.set_device(DEVICE_ID)), dataset root, and batch size in the first cell. Simultaneously configure the model in the third cell correctly to match the actual model used.

Execute all cells in sequence, and the code will automatically form a train step input with the first batchsize large samples (order by sample file size), completing the complete single forward and backward propagation of the train step.

If a dataset download occurs during data loading, it indicates that the given dataset root is incorrect, and the newly downloaded file needs to be deleted to keep the disk clean.

When executing the last cell, if the batch size set by the OOM representative is too large, please consider purchasing a larger VRAM GPU, gradient accumulation, using a smaller batch size, etc.

If all executions have ended without any errors, the current batchsize can be considered. You can use nvidia-smi command to check the VRAM use. However, it should be noted that your desktop operating environment, other programs on the graphics card, and some parallel training strategies, etc, all require a certain amount of VRAM.

If you are not good at using Jupyter, you can directly copy the code of each cell into a new .py file and execute it. If you need check VRAM use, you can add input()at the end of the file.

SunHaoOne · 2023-12-05T07:32:02Z

@HLSS-Hen
Hi, I've come up with a great idea. I experimented with using PyTorch 2's features and found that invoking model = torch.compile(model) in the train_qcnet.py significantly reduces memory usage. This approach leverages the new capabilities introduced in PyTorch 2, optimizing the model's memory footprint efficiently.

HLSS-Hen · 2023-12-05T07:59:04Z

@SunHaoOne ,
Oh, yes, this is likely because torch.compile combines some operators, reducing VRAM cost. In fact, I am not familiar with PyTorch Lightning, and I always thought that PL would automatically execute this function.

For those unfamiliar with torch.compile operations, please read PyTroch document about torch.compile，as this function will lead to an increase in training preparation time and memory (not VRAM) usage.

Add a jupyter file, for try largest batch

0422d29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a jupyter file, for try largest batch #23

Add a jupyter file, for try largest batch #23

HLSS-Hen commented Oct 10, 2023

ChengkaiYang commented Nov 19, 2023

ZikangZhou commented Nov 23, 2023

HLSS-Hen commented Nov 23, 2023

SunHaoOne commented Dec 5, 2023

HLSS-Hen commented Dec 5, 2023

Add a jupyter file, for try largest batch #23

Are you sure you want to change the base?

Add a jupyter file, for try largest batch #23

Conversation

HLSS-Hen commented Oct 10, 2023

ChengkaiYang commented Nov 19, 2023

ZikangZhou commented Nov 23, 2023

HLSS-Hen commented Nov 23, 2023

SunHaoOne commented Dec 5, 2023

HLSS-Hen commented Dec 5, 2023