Conditioning on image + text embedding #7

ChintanTrivedi · 2022-05-12T07:51:58Z

Looking for pointers to get started on modifying the conditioning code below to include conditioning on an image along with text.

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
text = torch.randn(2, 64)             # assume output of BERT-large has dimension of 64
loss = diffusion(videos, cond = text)

So far I am trying to condition on CLIP embeddings

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
image_emb = torch.randn(2, 512) # image (batch, CLIP ViT32 latent representation)
text_emb = torch.randn(2, 64) # assume output of BERT-large has dimension of 64

cond_emb = torch.cat((image_emb, text_emb),dim=1) # combining both image and text inputs to the video diffusion condition

loss = diffusion(videos, cond = cond_emb)

However, is there a better way to condition on images in the pixel space rather than latent representations? This might also help to use this in an autoregressive manner for last frame of the diffusion sample as input condition for the next sample.

PS: Thanks Phil for the quick implementation of an interesting paper that doesnt have the official code out yet!

The text was updated successfully, but these errors were encountered:

zkx06111 · 2022-05-18T02:25:28Z

I think you can try concatenating the image directly to the video frames in the channel dim.
That was what SR3 (a paper using image diffusion for image super-resolution) did.

ChintanTrivedi · 2022-05-18T07:43:36Z

Thanks @zkx06111, I checked it out, and that makes a lot of sense. Shouldn't it be along the frames dim instead of channel since this is video conditioned on image, not image conditioned on image?

If Noise is (32,3,10,128,128) and image condition is (32,3,128,128), then the concatenated input would be (32,3,11,128,128) where image is added to the front of the first frame in noise.

oxjohanndiep · 2022-07-25T07:46:48Z

@ChintanTrivedi Did you had success with that?

chpk · 2023-10-18T07:37:38Z

How do you condition (image/gif + text) on a custom input, the model should be loaded from already saved milestones/checkpoints in "./results/" folder.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditioning on image + text embedding #7

Conditioning on image + text embedding #7

ChintanTrivedi commented May 12, 2022 •

edited

zkx06111 commented May 18, 2022

ChintanTrivedi commented May 18, 2022 •

edited

oxjohanndiep commented Jul 25, 2022

chpk commented Oct 18, 2023

Conditioning on image + text embedding #7

Conditioning on image + text embedding #7

Comments

ChintanTrivedi commented May 12, 2022 • edited

zkx06111 commented May 18, 2022

ChintanTrivedi commented May 18, 2022 • edited

oxjohanndiep commented Jul 25, 2022

chpk commented Oct 18, 2023

ChintanTrivedi commented May 12, 2022 •

edited

ChintanTrivedi commented May 18, 2022 •

edited