Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training questions #44

Open
nom opened this issue May 2, 2024 · 6 comments
Open

Training questions #44

nom opened this issue May 2, 2024 · 6 comments

Comments

@nom
Copy link

nom commented May 2, 2024

Hey, great work! Quick question on training.

I was wondering how you're fitting two SDXL UNets (garment UNet and tryon UNet) on a single A800 with batch size 24/4=6 (assuming 4xA800 in total). I see you're using FP16 models, but are you doing any optimizations to bring memory down, like precomputing embeddings / features, 8bit adam or gradient accumulation? I'm trying to reproduce training, but can only fit 3 samples at 1024x768 resolution on 80GB VRAM during training and a single step takes ~1.3 seconds on a H100. I'm already doing the above tricks (8bit adam, precomputing VAE embeddings, frozen garment unet).

Also curious about training speed if you can share. Thanks!

@yisol
Copy link
Owner

yisol commented May 2, 2024

Hello, we used gradient checkpointing and 8 bit adam for training and fit batch size 6 to single A100 GPU.
We didn't use precomputing latents and embeddings or gradient accumulation but you can use them for reducing memory cost.
Training time was around 1~2day on 4xA100 GPU for 63k iterations.

@nom
Copy link
Author

nom commented May 2, 2024

Thanks @yisol. Are you perhaps not doing EMA?

Also if you could share a work-in-progress rough train script here, that'd be really helpful - just to get a better understanding of the differences with mine, doesn't have to be a working script.

@ifeherva
Copy link

ifeherva commented May 5, 2024

Did you use noise_offset or snr_gamma (=5) during training?

@cardosofelipe
Copy link

Thanks @yisol. Are you perhaps not doing EMA?

Also if you could share a work-in-progress rough train script here, that'd be really helpful - just to get a better understanding of the differences with mine, doesn't have to be a working script.

@yisol It would really helpful indeed

@Anustup900
Copy link

Hey @nom I am trying to replicate the training, it would be great if you share a glimpse of your script or an idea also will work.

@awzhgw
Copy link

awzhgw commented May 17, 2024

@nom can you share finetune code for me ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants