-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected keyword argument 'encoder_hidden_states' in VisionEncoderDecoder models #30825
Comments
Hi @4th05, thanks for raising this issue! The reason this is failing for the biogpt model, is that it hasn't been implemented to be used as a decoder with cross-attention. For other decoder-only models, like GPT2, you'll see that they accept The documentation is indeed misleading and should really be updated: we can't guarantee that all pretrained language models can be used as a decoder (for the same reasons biogpt doesn't work). In fact, as this a configurable composite model, we can't guarantee that all encoder-decoder pairings are compatible. Would you like to open a PR to add this feature to BioGPT? |
Hey @amyeroberts, thank you so much for your reply! Regarding the PR, BioGPT was just an example. My main goal is to connect my own medical ViT to BioMistral-7b, which encounters the same error as BioGPT (likely due to the lack of cross-attention, as you mentioned). I have embeddings from my custom ViT, making it straightforward to pass them as Since I am not currently able to add this PR (because I'm a beginner), as a temporary workaround for my local current project, do you think it would work applying a fusion between the image embeddings from my ViT and its associated text embeddings, passing this fusion to the |
This is more of a question for the forums, as we try to reserve the github issues for feature requests and bug reports. "Valid" is a bit undefined. You can certainly combine the embeddings and pass them to a decoder model, and you'd be surprised by what models can learn from their inputs. However, when you pass a tensors of tokens through to a model, the assumption is that they are all part of the same sequence and are in order. The attention calculation is also different from what is being done in the vision encoder-decoder. In this model, the image inputs are providing the queries for cross attention. In this fusion case, all of the image and text embeddings are used to create q, k, v for self-attention. In the cross-attention case, it's not necessary for the inputs to be across the same output space e.g. vocabulary but it is assumed for self-attention. |
System Info
transformers
version: 4.40.2Who can help?
@amyeroberts, @ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
From the given code (modified from the example in section Training here), we should receive a
torch.Tensor
with the loss value, something liketensor(10.9159, grad_fn=<NllLossBackward0>)
. But instead, it returns the following error:This error is occurring because inside the
forward
method of the classVisionEncoderDecoderModel
, the decoder model is being called as follows (see line 603 here):Where the parameter
encoder_hidden_states
comes from the vision encoder output (see line 585 here). But, of course, as the error is pointing out, there is noencoder_hidden_states
for this decoder.In the documentation, we found that "The VisionEncoderDecoderModel can be used to initialize an image-to-text model with any pretrained Transformer-based vision model as the encoder (e.g. ViT, BEiT, DeiT, Swin) and any pretrained language model as the decoder (e.g. RoBERTa, GPT2, BERT, DistilBERT)."
As a result, I assumed that any decoder, including BioGPT, would work. Therefore, I believe this is a bug.
The text was updated successfully, but these errors were encountered: