Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I use blip2 for vqa task training? #688

Open
WildLight opened this issue Apr 16, 2024 · 3 comments
Open

How should I use blip2 for vqa task training? #688

WildLight opened this issue Apr 16, 2024 · 3 comments

Comments

@WildLight
Copy link

Hello, thank you very much for your open source. I found that for the blip2 model, there is only the code for the caption task training, and there is no code for the vqa task. What should I do?

@Thomas2419
Copy link

Apologies my answer won't be the most in depth, but I haven't seen any of the devs responding to questions like this, probably for good reason they're busy people.

So I will try my best to assist, yes there is only code for caption task training, but in the training config, there is a designation of task which is set to image captioning, if you have the time and patience to spend on it you can do things like check out the code for the vqa dataset loader to determine the expected input, my suggestion switch model to flan_t5xl since that's what the demos use for vqa.

Then determine the changes needed to make flant5 work. In tasks you can find vqa i checked and compared some eval scripts and the task available, and am actually doing vqa training on my blip2_flant5xl model at this very moment. So i can try to go more in depth, however I am not a software engineer/Deep learning engineer by trade (however I am a deep learning hobbyist for the last 4 years) so I had to take a couple days of reverse engineering and testing and evaluating output to determine what I needed to fix to get this to work.

Apologies I don't fully remember all of the iterations and changes I had to make. I do regret to inform you though, that it is definitely worth the effort, for some reason most models like this do not utilize ViT-g/14 from EVA-CLIP but blip2 does, which if my memory is correct that version is the highest performing version of any clip size of a similar size. Thus has a heightened visual capability compared to many models.

@WildLight
Copy link
Author

Apologies my answer won't be the most in depth, but I haven't seen any of the devs responding to questions like this, probably for good reason they're busy people.

So I will try my best to assist, yes there is only code for caption task training, but in the training config, there is a designation of task which is set to image captioning, if you have the time and patience to spend on it you can do things like check out the code for the vqa dataset loader to determine the expected input, my suggestion switch model to flan_t5xl since that's what the demos use for vqa.

Then determine the changes needed to make flant5 work. In tasks you can find vqa i checked and compared some eval scripts and the task available, and am actually doing vqa training on my blip2_flant5xl model at this very moment. So i can try to go more in depth, however I am not a software engineer/Deep learning engineer by trade (however I am a deep learning hobbyist for the last 4 years) so I had to take a couple days of reverse engineering and testing and evaluating output to determine what I needed to fix to get this to work.

Apologies I don't fully remember all of the iterations and changes I had to make. I do regret to inform you though, that it is definitely worth the effort, for some reason most models like this do not utilize ViT-g/14 from EVA-CLIP but blip2 does, which if my memory is correct that version is the highest performing version of any clip size of a similar size. Thus has a heightened visual capability compared to many models.

hi,thank you. Now, I try to finished it.

@1832390030
Copy link

Hello, have you completed this project at present, would like to ask some questions, can you give me your contact information, thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants