-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT2.7B underperforming & weird behavior compared to flant5xl on image captioning? #676
Comments
May I ask if I can gain some knowledge from you on how to fine tune BLIP2? I have currently collected some pedestrian images and hope to use the image captioning feature of BLIP2 to obtain text descriptions of these images. I hope to receive your clarification! |
yes, my apologies I had seen your other comment and meant to respond to you. I will preface this all with I am no machine learning engineer I am and have been a machine learning hobbyist for the past 6 years. So sometimes i know how to get things to work but not why they work or what I may be messing up when I get them to work. I've learned largely from just asking questions to people who know how things work so this feels like quite the role reversal for me. There were a lot of edits I had to make so I'll go with an overview that i assume is most applicable to you, but without knowing whether you can run distributed runs and such there are a few preliminary steps I'm unsure if you need to take. On my rtx3090 and with certain oddities with the blip2 implementation i am totally unable to 100% utilize my gpu vram. Two of the fundamental changes that had to be made to make this work were: 1. Setting config to bf16, and editing eva_vit.py to cast vit to bf16 instead of fp16. This was due to when i set the vit_precision to fp16 I got nan instead of a normal loss. 2. Editing coca_caption to point towards my local dataset, this was admittedly a lazy fix you could just as easily make a new dataset builder and such. Some other miscellaneous fixes i had to do was set the model_type to use the pretrained not the coco finetuned as well as using 224 not 364 image size, unsure why that is as i can have 6 batch size at 224 and not one batch at 364 doesn't even make sense pixel wise as (364^2)/(224^2) does not equal 6 it's about 3 so in theory a batch size of 2 should be theoretically achievable. There may well be things i forgot many of the fixes i found in reported issues so please check if this doesn't work then let me know what the problem is. This also assumes your dataset is a folder containing matching image text file pairs. Edit the dataset loader if that is not the case. These were all converted to .txt files to be able to attached to this. Assuming you specifically want to use blip2 to do this, if any image captioning model works, open_clip coca_clip model also works very well and was much easier to test. Here's my edited coco_caption: Here's my edited eva_vit.py: |
Thank you for your reply. Your reply is very timely and important to me. Thank you again! |
@shams2023 I apologize I realize i have given you an incorrect caption datasets for utilizing flant5xl, this is what I is required as flant5xl needs text_input and text_output. this is at the end of the class CaptionDataset(BaseDataset, __DisplMixin): in the caption_datasets.py file: |
You're right, bro! Thank you for your help! |
Hello! I was finetuning from the pretrained_flant5xl and pretrained_opt2.7b models, much to my surprise the flant5xl model is excelling at creating correct labels, as my captions are actually a string of labels. My objective was to determine the feasibility of training these models on complicated interconnected labels, some of them having subcategories some of them not. Flan correctly produces all of the words at a very high accuracy while opt starts to randomly use characters like ~ that aren't present anywhere in my dataset, and replaces a couple sets of labels with "urchin". So anywhere where it would presumably predict labels 1 2 and 3 for example it says ~ urchin ~, which in my dataset are actually fantasy races. Which is clearly indicating the model understands that there should be the correct label in the spot as it follows a sort of logic and only replaces certain labels.
This is on a custom dataset of image-txt pairs that I implemented. It is a bit small around 2104 images, it actually represents labels that have about sub options so it's like 20 choices(average of 4 subcategories), 3 choices, 13 choices(average of 7 subcategories), 8 choices, 8 choices, 4 choices.
A couple notes, I am editing the dataset loader to provide text_input for opt as is normal in the lavis repository and for flan to have the prompt as text_input and caption as text_output. As well both models are being trained with the VIT model casted to bf16, which doesn't seem to have diminished the quality. Another thing is that the opt2.7b models loss is substantially lower than the flant5 model for some reason. Please let me know if anyone has any ideas as to how I can fix this, or suggestions of things to try. Thanks for the help!
The text was updated successfully, but these errors were encountered: