Training questions #44

nom · 2024-05-02T00:03:22Z

Hey, great work! Quick question on training.

I was wondering how you're fitting two SDXL UNets (garment UNet and tryon UNet) on a single A800 with batch size 24/4=6 (assuming 4xA800 in total). I see you're using FP16 models, but are you doing any optimizations to bring memory down, like precomputing embeddings / features, 8bit adam or gradient accumulation? I'm trying to reproduce training, but can only fit 3 samples at 1024x768 resolution on 80GB VRAM during training and a single step takes ~1.3 seconds on a H100. I'm already doing the above tricks (8bit adam, precomputing VAE embeddings, frozen garment unet).

Also curious about training speed if you can share. Thanks!

yisol · 2024-05-02T16:04:07Z

Hello, we used gradient checkpointing and 8 bit adam for training and fit batch size 6 to single A100 GPU.
We didn't use precomputing latents and embeddings or gradient accumulation but you can use them for reducing memory cost.
Training time was around 1~2day on 4xA100 GPU for 63k iterations.

nom · 2024-05-02T16:44:47Z

Thanks @yisol. Are you perhaps not doing EMA?

Also if you could share a work-in-progress rough train script here, that'd be really helpful - just to get a better understanding of the differences with mine, doesn't have to be a working script.

ifeherva · 2024-05-05T18:04:23Z

Did you use noise_offset or snr_gamma (=5) during training?

cardosofelipe · 2024-05-16T14:03:15Z

Thanks @yisol. Are you perhaps not doing EMA?

Also if you could share a work-in-progress rough train script here, that'd be really helpful - just to get a better understanding of the differences with mine, doesn't have to be a working script.

@yisol It would really helpful indeed

Anustup900 · 2024-05-17T05:55:45Z

Hey @nom I am trying to replicate the training, it would be great if you share a glimpse of your script or an idea also will work.

awzhgw · 2024-05-17T10:26:33Z

@nom can you share finetune code for me ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training questions #44

Training questions #44

nom commented May 2, 2024 •

edited

yisol commented May 2, 2024

nom commented May 2, 2024 •

edited

ifeherva commented May 5, 2024

cardosofelipe commented May 16, 2024

Anustup900 commented May 17, 2024

awzhgw commented May 17, 2024

Training questions #44

Training questions #44

Comments

nom commented May 2, 2024 • edited

yisol commented May 2, 2024

nom commented May 2, 2024 • edited

ifeherva commented May 5, 2024

cardosofelipe commented May 16, 2024

Anustup900 commented May 17, 2024

awzhgw commented May 17, 2024

nom commented May 2, 2024 •

edited

nom commented May 2, 2024 •

edited