Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too sensitive to prompting #23

Closed
cmgzy opened this issue May 9, 2024 · 21 comments · Fixed by #24
Closed

Too sensitive to prompting #23

cmgzy opened this issue May 9, 2024 · 21 comments · Fixed by #24

Comments

@cmgzy
Copy link

cmgzy commented May 9, 2024

I found some VLMs are too sensitive to prompt. For example, when I use mlx-community/llava-1.5-7b-4bit:
the image is:
image

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "how many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is (which is correct):
There are nine dogs in the image.

but if I change the prompt to "How many dogs in the image?"..

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "How many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is wrong:
There are seven dogs in the image.
I also tried llava-llama-3-8b-v1_1-8bit/llava-phi-3-mini-8bit/idefics2-8b-chatty-8bit with both "how..." and "How...", but response were wrong all the time.

@Blaizzy
Copy link
Owner

Blaizzy commented May 9, 2024

Thanks for sharing!

I will look into this 👌🏽

@cmgzy
Copy link
Author

cmgzy commented May 13, 2024

FYI, I found that https://huggingface.co/liuhaotian/llava-v1.6-34b doing very well on the dog number test.
I've tested on this demo:
https://llava.hliu.cc/
I made up several clips with various dog number and prompted with "How many dogs are there in the image? Answer the question using a single word or phrase.", it always answered correctly.

image

@Blaizzy
Copy link
Owner

Blaizzy commented May 13, 2024

Awesome!

That model is based on the llava-next architecture which we don't support at the moment.

Would you like to make a PR to add it ?

@Blaizzy
Copy link
Owner

Blaizzy commented May 13, 2024

Did you test the transformer versions of the previous models you reported ?

@cmgzy
Copy link
Author

cmgzy commented May 13, 2024

Did you test the transformer versions of the previous models you reported ?

well, It's kind of difficult to do that since my laptop has 16GB RAM only. Transformer versions without great quantization running too slow...

@Blaizzy
Copy link
Owner

Blaizzy commented May 13, 2024

No worries. I will run those tests in a few

@Blaizzy
Copy link
Owner

Blaizzy commented May 13, 2024

@cmgzy I ran some tests.

And it turns out the models give accurate answers if you run them in full precision or in 8bit

The problem is the mlx 4bit quantisation. The latest mlx release (v0.13.0) fixes this and the new 4bit model answers correctly.

I'm uploading it and also adding 8bit.

Please give it try and let me know if you find any other issues.

@cmgzy
Copy link
Author

cmgzy commented May 14, 2024

@Blaizzy
for mlx-community/llava-1.5-7b-8bit, model files are not ready.
for mlx-community/llava-1.5-7b-4bit, cannot run with mlx==0.13.0 and mlx-vlm==0.0.4, error info:
Traceback (most recent call last):
File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 86, in run_code
exec(code, run_globals)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 107, in
main()
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 92, in main
output = generate(
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/utils.py", line 758, in generate
logits, cache = model(input_ids, pixel_values)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 134, in call
input_embddings = self.get_input_embeddings(input_ids, pixel_values)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 77, in get_input_embeddings
*
, hidden_states = self.vision_tower(
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 220, in call
return self.vision_model(x, output_hidden_states)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 193, in call
x = self.embeddings(x)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 176, in call
embeddings += self.position_embedding.weight
ValueError: Shapes (1,577,1024) and (577,128) cannot be broadcast.

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

@cmgzy
Copy link
Author

cmgzy commented May 14, 2024

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

It works for mlx-community/llava-1.5-7b-4bit

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

Awesome!

I will update all 4bit models with the latest mlx core 👌🏽

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

And will update the pypi today as well.

Is there anything else you want me address?

@cmgzy
Copy link
Author

cmgzy commented May 14, 2024

And will update the pypi today as well.

Is there anything else you want me address?

There is.

For example:
python -m mlx_vlm.generate --model mlx-community/llava-llama-3-8b-v1_1-8bit
--prompt "How many dogs are there in the image? Answer the question using a single word or phrase." --image "/Users/chenmi/Pictures/xx.jpg"
--max-tokens 512 --temp 0.2

The response is weird...

Image: /Users/chenmi/Pictures/xx.jpg

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

How many dogs are there in the image? Answer the question using a single word or phrase.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

6<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|> all the dogs are black and white<|eot_id|><|eot_id|><|eot_id|><|eot_id|> except for 2<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

Regarding the answer the model gave

I'm yet to update the quantisation. I will ping you once I update it.

@cmgzy
Copy link
Author

cmgzy commented May 14, 2024

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

It fixed!

@Blaizzy
Copy link
Owner

Blaizzy commented May 14, 2024

Fantastic!

I'm uploading llava 8bit then I will patch other models 👌🏽

@cmgzy
Copy link
Author

cmgzy commented May 15, 2024

Did you test the transformer versions of the previous models you reported ?

Could you share sample code for testing transformer versions using Apple Silicon GPUs? So that I may help you test other models from then on. Thx!

@Blaizzy
Copy link
Owner

Blaizzy commented May 15, 2024

Here you go:

from transformers import AutoProcessor, AutoModelForPreTraining
from PIL import Image
import requests
import torch

model_id = "llava-hf/llava-1.5-7b-hf"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = AutoModelForPreTraining.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)

@cmgzy
Copy link
Author

cmgzy commented May 16, 2024

Here you go:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "mlx-community/llava-1.5-7b-4bit"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)

Which model should I use to load "mlx-community/llava-1.5-7b-4bit", may not be PaliGemmaForConditionalGeneration...

@Blaizzy
Copy link
Owner

Blaizzy commented May 16, 2024

Sorry, fixed it ✅

Use AutoModelForPreTraining

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants