Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepfloyd stage 2 crashes with tensor size mismatch when input image size is not divisible by 8 #7842

Open
bghira opened this issue May 2, 2024 · 2 comments · May be fixed by #7844
Open
Labels
bug Something isn't working

Comments

@bghira
Copy link
Contributor

bghira commented May 2, 2024

Describe the bug

DeepFloyd's upstream code supports 8px-aligned inputs for stage II, which I believe the Diffusers implementation is based upon. However, it seems that for certain sizes, there is some unfortunate interaction between the hidden states and the residual hidden states.

I'm not sure if this is something fundamental to the model - if it is, we probably want to understand the conditions under which this problem occurs and provide an error to the user about an incompatible resolution.

Reproduction

from diffusers import IFSuperResolutionPipeline
import torch
from PIL import Image
import numpy as np

torch.manual_seed(42)

# Configuration for initial image and desired output
initial_width = 86  # Adjusted width to be one-fourth of 344 (approximately)
initial_height = 64  # Adjusted height to be one-fourth of 256

# Initialize your device setting based on availability
torch_device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "xpu" if torch.xpu.is_available() else "cpu"

# Create a dummy image (86x64)
dummy_image = torch.rand((3, initial_height, initial_width), dtype=torch.float32)  # Random noise image
dummy_image = (dummy_image * 255).to(torch.uint8)  # Convert to 8-bit format
dummy_pil_image = Image.fromarray(dummy_image.numpy().transpose(1, 2, 0))  # Convert to PIL image for compatibility
dummy_pil_image.save("dummy_input.png")  # Save the initial dummy image

# Load your stage 2 pipeline
stage2_pipe = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", watermarker=None, safety_checker=None, local_files_only=False).to(device=torch_device, dtype=torch.bfloat16)

# Upscale the dummy image using stage 2 of the pipeline
upscaled_image = stage2_pipe(
    prompt="A simple upscaled image", 
    image=dummy_pil_image, 
    guidance_scale=5.5, 
    num_inference_steps=20, 
    width=344, 
    height=256
).images[0]

upscaled_image.save("upscaled_dummy_output.png")

Logs

0%|                                                                                                                                                                                                                                                                                             | 0/20 [00:00<?, ?it/s]

hidden_states.shape: torch.Size([2, 768, 16, 21])
res_hidden_states.shape: torch.Size([2, 768, 16, 21])
hidden_states.shape: torch.Size([2, 768, 16, 21])
res_hidden_states.shape: torch.Size([2, 768, 16, 21])
hidden_states.shape: torch.Size([2, 768, 16, 21])
res_hidden_states.shape: torch.Size([2, 768, 16, 21])
hidden_states.shape: torch.Size([2, 768, 32, 42])
res_hidden_states.shape: torch.Size([2, 768, 32, 43])

System Info

  • diffusers version: 0.27.2
  • Platform: macOS-14.4.1-arm64-arm-64bit
  • Python version: 3.10.14
  • PyTorch version (GPU?): 2.4.0.dev20240421 (False)
  • Huggingface_hub version: 0.22.2
  • Transformers version: 4.40.0.dev0
  • Accelerate version: 0.26.1

Who can help?

@DN6 @yiyixuxu

@bghira bghira added the bug Something isn't working label May 2, 2024
@bghira bghira changed the title deepfloyd stage 2 crashes with tensor size mismatch when image is divisible by 8 (sometimes) deepfloyd stage 2 crashes with tensor size mismatch when image size is divisible by 8 (sometimes) May 2, 2024
@bghira
Copy link
Contributor Author

bghira commented May 2, 2024

hmm so 86 isn't divisible by 8.

if i adjust the script like so:

from diffusers import DiffusionPipeline, IFSuperResolutionPipeline
import torch
from PIL import Image
import numpy as np

torch.manual_seed(42)

# Configuration for initial image and desired output
initial_width = 86  # Adjusted width to be one-fourth of 344 (approximately)
initial_height = 64  # Adjusted height to be one-fourth of 256

# Adjust initial_width to be divisible by 8
initial_width = int(np.ceil(initial_width / 8) * 8)
print(f"Resolution: {initial_width}x{initial_height}")
# Initialize your device setting based on availability
torch_device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "xpu" if torch.xpu.is_available() else "cpu"

# Create a dummy image (86x64)
dummy_image = torch.rand((3, initial_height, initial_width), dtype=torch.float32)  # Random noise image
dummy_image = (dummy_image * 255).to(torch.uint8)  # Convert to 8-bit format
dummy_pil_image = Image.fromarray(dummy_image.numpy().transpose(1, 2, 0))  # Convert to PIL image for compatibility
dummy_pil_image.save("dummy_input.png")  # Save the initial dummy image

# Load your stage 2 pipeline
print(f"Image resolution: {dummy_pil_image.size}")
stage2_pipe = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", watermarker=None, safety_checker=None, local_files_only=False).to(device=torch_device, dtype=torch.bfloat16)

# Upscale the dummy image using stage 2 of the pipeline
upscaled_image = stage2_pipe(
    prompt="A simple upscaled image", 
    image=dummy_pil_image, 
    guidance_scale=5.5, 
    num_inference_steps=20, 
    width=initial_width * 4, 
    height=initial_height * 4
).images[0]

upscaled_image.save("upscaled_dummy_output.png")

there is no crash

bghira pushed a commit to bghira/diffusers that referenced this issue May 2, 2024
@bghira
Copy link
Contributor Author

bghira commented May 2, 2024

note: i understand deepfloyd is not often used by commercial outfits due to its restrictive license, but it apparently has research value and i've run into this during research into deepfloyd's characteristics with the T5 text encoder (which is worthwhile to explore, now that there are more models available to compare against). this PR is an effort to improve the experience for research use of these weights.

@bghira bghira changed the title deepfloyd stage 2 crashes with tensor size mismatch when image size is divisible by 8 (sometimes) deepfloyd stage 2 crashes with tensor size mismatch when input image size is divisible by 8 May 2, 2024
@bghira bghira changed the title deepfloyd stage 2 crashes with tensor size mismatch when input image size is divisible by 8 deepfloyd stage 2 crashes with tensor size mismatch when input image size is not divisible by 8 May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant