You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.
Steps Taken
prepared AudioCaps dataset
fix typos in the base config files for both autoencoderkl and audioldm folders
updated json and sh files according with my dataset
launched the train script with sh egs/tta/autoencoderkl/run_train.sh, no further modification -> it works on the first GPU, as expected
modified run_train.sh#L19 as `export CUDA_VISIBLE_DEVICES="0,1,2,3" -> it works on the first GPU only
keeping point 4, also changed exp_config.json#L38 to "ddp": true -> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT)
reverted 4 and 5, and thought to leverage accelerate: run accelerate config to set up a single node multiGPU training. accelerate test works fine on the 4 GPUs.
Removed run_train.sh#L19 and modified run_train.sh#L22 to accelerate launch "${work_dir}"/bins/tta/train_tta.py -> I see 4 processes on the first GPU, then it goes OOM.
Expected Outcome
A single train job on 4 GPUs.
Environment Information
Operating System: Ubuntu 22.04 LTS
Python Version: Python 3.9.15 (conda env created following your instruction)
Driver & CUDA Version: CUDA 12.2, Driver 535.86.10
Error Messages and Logs: See Steps Taken above
The text was updated successfully, but these errors were encountered:
Hi, TTA now only supports single GPU training, you can refer to other tasks to implement multi-card training based on accelerate. Welcome to submit PRs.
Problem Overview
I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.
Steps Taken
AudioCaps
datasetautoencoderkl
andaudioldm
folderssh egs/tta/autoencoderkl/run_train.sh
, no further modification -> it works on the first GPU, as expected"ddp": true
-> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT)accelerate
: runaccelerate config
to set up a single node multiGPU training.accelerate test
works fine on the 4 GPUs.accelerate launch "${work_dir}"/bins/tta/train_tta.py
-> I see 4 processes on the first GPU, then it goes OOM.Expected Outcome
A single train job on 4 GPUs.
Environment Information
Steps Taken
aboveThe text was updated successfully, but these errors were encountered: