Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

Open
2 of 4 tasks
4th05 opened this issue May 15, 2024 · 3 comments
Open
2 of 4 tasks

Comments

@4th05
Copy link

4th05 commented May 15, 2024

System Info

  • transformers version: 4.40.2
  • Platform: Linux-6.1.58+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): 2.15.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.8.3 (gpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.26
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@amyeroberts, @ArthurZucker and @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from datasets import load_dataset

image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "microsoft/biogpt"
)

# Set padding and cls tokens
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.decoder.resize_token_embeddings(len(tokenizer))

if tokenizer.cls_token_id is None:
    tokenizer.add_special_tokens({'cls_token': '[CLS]'})
    model.decoder.resize_token_embeddings(len(tokenizer))

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values

labels = tokenizer(
    "an image of two cats chilling on a couch",
    return_tensors="pt",
).input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss

Expected behavior

From the given code (modified from the example in section Training here), we should receive a torch.Tensor with the loss value, something like tensor(10.9159, grad_fn=<NllLossBackward0>). But instead, it returns the following error:

TypeError: BioGptForCausalLM.forward() got an unexpected keyword argument 'encoder_hidden_states'

This error is occurring because inside the forward method of the class VisionEncoderDecoderModel, the decoder model is being called as follows (see line 603 here):

        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            inputs_embeds=decoder_inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            use_cache=use_cache,
            past_key_values=past_key_values,
            return_dict=return_dict,
            **kwargs_decoder,
        )

Where the parameter encoder_hidden_states comes from the vision encoder output (see line 585 here). But, of course, as the error is pointing out, there is no encoder_hidden_states for this decoder.

In the documentation, we found that "The VisionEncoderDecoderModel can be used to initialize an image-to-text model with any pretrained Transformer-based vision model as the encoder (e.g. ViT, BEiT, DeiT, Swin) and any pretrained language model as the decoder (e.g. RoBERTa, GPT2, BERT, DistilBERT)."

As a result, I assumed that any decoder, including BioGPT, would work. Therefore, I believe this is a bug.

@amyeroberts
Copy link
Collaborator

Hi @4th05, thanks for raising this issue!

The reason this is failing for the biogpt model, is that it hasn't been implemented to be used as a decoder with cross-attention. For other decoder-only models, like GPT2, you'll see that they accept encoder_hidden_states which is an additional feature to enable using them in composite models like these.

The documentation is indeed misleading and should really be updated: we can't guarantee that all pretrained language models can be used as a decoder (for the same reasons biogpt doesn't work). In fact, as this a configurable composite model, we can't guarantee that all encoder-decoder pairings are compatible.

Would you like to open a PR to add this feature to BioGPT?

@4th05
Copy link
Author

4th05 commented May 15, 2024

Hey @amyeroberts, thank you so much for your reply!

Regarding the PR, BioGPT was just an example. My main goal is to connect my own medical ViT to BioMistral-7b, which encounters the same error as BioGPT (likely due to the lack of cross-attention, as you mentioned). I have embeddings from my custom ViT, making it straightforward to pass them as encoder_outputs to VisionEncoderDecoderModel with models like GPT2 and BERT, which accept encoder_hidden_states However, this approach fails with domain-specific decoders like the ones mentioned.

Since I am not currently able to add this PR (because I'm a beginner), as a temporary workaround for my local current project, do you think it would work applying a fusion between the image embeddings from my ViT and its associated text embeddings, passing this fusion to the input_embeds argument of the VisionEncoderDecoderModel, which essentially serves as the decoder’s input_embeds? I mean, I know it "works" because it doesn't break the code, but is this approach conceptually valid?

@amyeroberts
Copy link
Collaborator

Since I am not currently able to add this PR (because I'm a beginner), as a temporary workaround for my local current project, do you think it would work applying a fusion between the image embeddings from my ViT and its associated text embeddings, passing this fusion to the input_embeds argument of the VisionEncoderDecoderModel, which essentially serves as the decoder’s input_embeds? I mean, I know it "works" because it doesn't break the code, but is this approach conceptually valid?

This is more of a question for the forums, as we try to reserve the github issues for feature requests and bug reports.

"Valid" is a bit undefined. You can certainly combine the embeddings and pass them to a decoder model, and you'd be surprised by what models can learn from their inputs. However, when you pass a tensors of tokens through to a model, the assumption is that they are all part of the same sequence and are in order.

The attention calculation is also different from what is being done in the vision encoder-decoder. In this model, the image inputs are providing the queries for cross attention. In this fusion case, all of the image and text embeddings are used to create q, k, v for self-attention. In the cross-attention case, it's not necessary for the inputs to be across the same output space e.g. vocabulary but it is assumed for self-attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants