Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditioning on image + text embedding #7

Open
ChintanTrivedi opened this issue May 12, 2022 · 4 comments
Open

Conditioning on image + text embedding #7

ChintanTrivedi opened this issue May 12, 2022 · 4 comments

Comments

@ChintanTrivedi
Copy link

ChintanTrivedi commented May 12, 2022

Looking for pointers to get started on modifying the conditioning code below to include conditioning on an image along with text.

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
text = torch.randn(2, 64)             # assume output of BERT-large has dimension of 64
loss = diffusion(videos, cond = text)

So far I am trying to condition on CLIP embeddings

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
image_emb = torch.randn(2, 512) # image (batch, CLIP ViT32 latent representation)
text_emb = torch.randn(2, 64) # assume output of BERT-large has dimension of 64

cond_emb = torch.cat((image_emb, text_emb),dim=1) # combining both image and text inputs to the video diffusion condition

loss = diffusion(videos, cond = cond_emb)

However, is there a better way to condition on images in the pixel space rather than latent representations? This might also help to use this in an autoregressive manner for last frame of the diffusion sample as input condition for the next sample.

PS: Thanks Phil for the quick implementation of an interesting paper that doesnt have the official code out yet!

@zkx06111
Copy link

I think you can try concatenating the image directly to the video frames in the channel dim.
That was what SR3 (a paper using image diffusion for image super-resolution) did.

@ChintanTrivedi
Copy link
Author

ChintanTrivedi commented May 18, 2022

Thanks @zkx06111, I checked it out, and that makes a lot of sense. Shouldn't it be along the frames dim instead of channel since this is video conditioned on image, not image conditioned on image?

If Noise is (32,3,10,128,128) and image condition is (32,3,128,128), then the concatenated input would be (32,3,11,128,128) where image is added to the front of the first frame in noise.

@oxjohanndiep
Copy link

@ChintanTrivedi Did you had success with that?

@chpk
Copy link

chpk commented Oct 18, 2023

How do you condition (image/gif + text) on a custom input, the model should be loaded from already saved milestones/checkpoints in "./results/" folder.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants