How should I use blip2 for vqa task training? #688

WildLight · 2024-04-16T07:13:18Z

Hello, thank you very much for your open source. I found that for the blip2 model, there is only the code for the caption task training, and there is no code for the vqa task. What should I do?

Thomas2419 · 2024-04-19T17:38:30Z

Apologies my answer won't be the most in depth, but I haven't seen any of the devs responding to questions like this, probably for good reason they're busy people.

So I will try my best to assist, yes there is only code for caption task training, but in the training config, there is a designation of task which is set to image captioning, if you have the time and patience to spend on it you can do things like check out the code for the vqa dataset loader to determine the expected input, my suggestion switch model to flan_t5xl since that's what the demos use for vqa.

Then determine the changes needed to make flant5 work. In tasks you can find vqa i checked and compared some eval scripts and the task available, and am actually doing vqa training on my blip2_flant5xl model at this very moment. So i can try to go more in depth, however I am not a software engineer/Deep learning engineer by trade (however I am a deep learning hobbyist for the last 4 years) so I had to take a couple days of reverse engineering and testing and evaluating output to determine what I needed to fix to get this to work.

Apologies I don't fully remember all of the iterations and changes I had to make. I do regret to inform you though, that it is definitely worth the effort, for some reason most models like this do not utilize ViT-g/14 from EVA-CLIP but blip2 does, which if my memory is correct that version is the highest performing version of any clip size of a similar size. Thus has a heightened visual capability compared to many models.

WildLight · 2024-04-20T16:09:39Z

Apologies my answer won't be the most in depth, but I haven't seen any of the devs responding to questions like this, probably for good reason they're busy people.

So I will try my best to assist, yes there is only code for caption task training, but in the training config, there is a designation of task which is set to image captioning, if you have the time and patience to spend on it you can do things like check out the code for the vqa dataset loader to determine the expected input, my suggestion switch model to flan_t5xl since that's what the demos use for vqa.

Then determine the changes needed to make flant5 work. In tasks you can find vqa i checked and compared some eval scripts and the task available, and am actually doing vqa training on my blip2_flant5xl model at this very moment. So i can try to go more in depth, however I am not a software engineer/Deep learning engineer by trade (however I am a deep learning hobbyist for the last 4 years) so I had to take a couple days of reverse engineering and testing and evaluating output to determine what I needed to fix to get this to work.

Apologies I don't fully remember all of the iterations and changes I had to make. I do regret to inform you though, that it is definitely worth the effort, for some reason most models like this do not utilize ViT-g/14 from EVA-CLIP but blip2 does, which if my memory is correct that version is the highest performing version of any clip size of a similar size. Thus has a heightened visual capability compared to many models.

hi，thank you. Now, I try to finished it.

1832390030 · 2024-04-28T11:58:42Z

Hello, have you completed this project at present, would like to ask some questions, can you give me your contact information, thank you very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should I use blip2 for vqa task training? #688

How should I use blip2 for vqa task training? #688

WildLight commented Apr 16, 2024

Thomas2419 commented Apr 19, 2024

WildLight commented Apr 20, 2024

1832390030 commented Apr 28, 2024

How should I use blip2 for vqa task training? #688

How should I use blip2 for vqa task training? #688

Comments

WildLight commented Apr 16, 2024

Thomas2419 commented Apr 19, 2024

WildLight commented Apr 20, 2024

1832390030 commented Apr 28, 2024