Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

4th05 · 2024-05-15T11:32:00Z

System Info

transformers version: 4.40.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): 2.15.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.3 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@amyeroberts, @ArthurZucker and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from datasets import load_dataset

image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "microsoft/biogpt"
)

# Set padding and cls tokens
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.decoder.resize_token_embeddings(len(tokenizer))

if tokenizer.cls_token_id is None:
    tokenizer.add_special_tokens({'cls_token': '[CLS]'})
    model.decoder.resize_token_embeddings(len(tokenizer))

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values

labels = tokenizer(
    "an image of two cats chilling on a couch",
    return_tensors="pt",
).input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss

Expected behavior

From the given code (modified from the example in section Training here), we should receive a torch.Tensor with the loss value, something like tensor(10.9159, grad_fn=<NllLossBackward0>). But instead, it returns the following error:

TypeError: BioGptForCausalLM.forward() got an unexpected keyword argument 'encoder_hidden_states'

This error is occurring because inside the forward method of the class VisionEncoderDecoderModel, the decoder model is being called as follows (see line 603 here):

        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            inputs_embeds=decoder_inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            use_cache=use_cache,
            past_key_values=past_key_values,
            return_dict=return_dict,
            **kwargs_decoder,
        )

Where the parameter encoder_hidden_states comes from the vision encoder output (see line 585 here). But, of course, as the error is pointing out, there is no encoder_hidden_states for this decoder.

In the documentation, we found that "The VisionEncoderDecoderModel can be used to initialize an image-to-text model with any pretrained Transformer-based vision model as the encoder (e.g. ViT, BEiT, DeiT, Swin) and any pretrained language model as the decoder (e.g. RoBERTa, GPT2, BERT, DistilBERT)."

As a result, I assumed that any decoder, including BioGPT, would work. Therefore, I believe this is a bug.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-15T14:50:03Z

Hi @4th05, thanks for raising this issue!

The reason this is failing for the biogpt model, is that it hasn't been implemented to be used as a decoder with cross-attention. For other decoder-only models, like GPT2, you'll see that they accept encoder_hidden_states which is an additional feature to enable using them in composite models like these.

The documentation is indeed misleading and should really be updated: we can't guarantee that all pretrained language models can be used as a decoder (for the same reasons biogpt doesn't work). In fact, as this a configurable composite model, we can't guarantee that all encoder-decoder pairings are compatible.

Would you like to open a PR to add this feature to BioGPT?

4th05 · 2024-05-15T17:04:53Z

Hey @amyeroberts, thank you so much for your reply!

Regarding the PR, BioGPT was just an example. My main goal is to connect my own medical ViT to BioMistral-7b, which encounters the same error as BioGPT (likely due to the lack of cross-attention, as you mentioned). I have embeddings from my custom ViT, making it straightforward to pass them as encoder_outputs to VisionEncoderDecoderModel with models like GPT2 and BERT, which accept encoder_hidden_states However, this approach fails with domain-specific decoders like the ones mentioned.

Since I am not currently able to add this PR (because I'm a beginner), as a temporary workaround for my local current project, do you think it would work applying a fusion between the image embeddings from my ViT and its associated text embeddings, passing this fusion to the input_embeds argument of the VisionEncoderDecoderModel, which essentially serves as the decoder’s input_embeds? I mean, I know it "works" because it doesn't break the code, but is this approach conceptually valid?

amyeroberts · 2024-05-15T17:46:02Z

Since I am not currently able to add this PR (because I'm a beginner), as a temporary workaround for my local current project, do you think it would work applying a fusion between the image embeddings from my ViT and its associated text embeddings, passing this fusion to the input_embeds argument of the VisionEncoderDecoderModel, which essentially serves as the decoder’s input_embeds? I mean, I know it "works" because it doesn't break the code, but is this approach conceptually valid?

This is more of a question for the forums, as we try to reserve the github issues for feature requests and bug reports.

"Valid" is a bit undefined. You can certainly combine the embeddings and pass them to a decoder model, and you'd be surprised by what models can learn from their inputs. However, when you pass a tensors of tokens through to a model, the assumption is that they are all part of the same sequence and are in order.

The attention calculation is also different from what is being done in the vision encoder-decoder. In this model, the image inputs are providing the queries for cross attention. In this fusion case, all of the image and text embeddings are used to create q, k, v for self-attention. In the cross-attention case, it's not necessary for the inputs to be across the same output space e.g. vocabulary but it is assumed for self-attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

4th05 commented May 15, 2024

amyeroberts commented May 15, 2024

4th05 commented May 15, 2024 •

edited

amyeroberts commented May 15, 2024

Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825

Comments

4th05 commented May 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented May 15, 2024

4th05 commented May 15, 2024 • edited

amyeroberts commented May 15, 2024

4th05 commented May 15, 2024 •

edited