Too sensitive to prompting #23

cmgzy · 2024-05-09T10:32:45Z

I found some VLMs are too sensitive to prompt. For example, when I use mlx-community/llava-1.5-7b-4bit:
the image is:

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "how many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is (which is correct):
There are nine dogs in the image.

but if I change the prompt to "How many dogs in the image?"..

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "How many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is wrong:
There are seven dogs in the image.
I also tried llava-llama-3-8b-v1_1-8bit/llava-phi-3-mini-8bit/idefics2-8b-chatty-8bit with both "how..." and "How...", but response were wrong all the time.

The text was updated successfully, but these errors were encountered:

Blaizzy · 2024-05-09T10:41:53Z

Thanks for sharing!

I will look into this 👌🏽

cmgzy · 2024-05-13T01:32:24Z

FYI, I found that https://huggingface.co/liuhaotian/llava-v1.6-34b doing very well on the dog number test.
I've tested on this demo:
https://llava.hliu.cc/
I made up several clips with various dog number and prompted with "How many dogs are there in the image? Answer the question using a single word or phrase.", it always answered correctly.

Blaizzy · 2024-05-13T07:39:59Z

Awesome!

That model is based on the llava-next architecture which we don't support at the moment.

Would you like to make a PR to add it ?

Blaizzy · 2024-05-13T07:40:50Z

Did you test the transformer versions of the previous models you reported ?

cmgzy · 2024-05-13T07:45:32Z

Did you test the transformer versions of the previous models you reported ?

well, It's kind of difficult to do that since my laptop has 16GB RAM only. Transformer versions without great quantization running too slow...

Blaizzy · 2024-05-13T09:58:40Z

No worries. I will run those tests in a few

Blaizzy · 2024-05-13T22:29:57Z

@cmgzy I ran some tests.

And it turns out the models give accurate answers if you run them in full precision or in 8bit

The problem is the mlx 4bit quantisation. The latest mlx release (v0.13.0) fixes this and the new 4bit model answers correctly.

I'm uploading it and also adding 8bit.

Please give it try and let me know if you find any other issues.

cmgzy · 2024-05-14T08:46:21Z

@Blaizzy
for mlx-community/llava-1.5-7b-8bit, model files are not ready.
for mlx-community/llava-1.5-7b-4bit, cannot run with mlx==0.13.0 and mlx-vlm==0.0.4, error info:
Traceback (most recent call last):
File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 86, in run_code
exec(code, run_globals)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 107, in
main()
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 92, in main
output = generate(
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/utils.py", line 758, in generate
logits, cache = model(input_ids, pixel_values)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 134, in call
input_embddings = self.get_input_embeddings(input_ids, pixel_values)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 77, in get_input_embeddings
*, hidden_states = self.vision_tower(
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 220, in call
return self.vision_model(x, output_hidden_states)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 193, in call
x = self.embeddings(x)
File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 176, in call
embeddings += self.position_embedding.weight
ValueError: Shapes (1,577,1024) and (577,128) cannot be broadcast.

Blaizzy · 2024-05-14T09:03:05Z

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

cmgzy · 2024-05-14T09:09:32Z

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

It works for mlx-community/llava-1.5-7b-4bit

Blaizzy · 2024-05-14T09:15:19Z

Awesome!

I will update all 4bit models with the latest mlx core 👌🏽

Blaizzy · 2024-05-14T09:16:49Z

And will update the pypi today as well.

Is there anything else you want me address?

cmgzy · 2024-05-14T09:24:29Z

And will update the pypi today as well.

Is there anything else you want me address?

There is.

For example:
python -m mlx_vlm.generate --model mlx-community/llava-llama-3-8b-v1_1-8bit
--prompt "How many dogs are there in the image? Answer the question using a single word or phrase." --image "/Users/chenmi/Pictures/xx.jpg"
--max-tokens 512 --temp 0.2

The response is weird...

Image: /Users/chenmi/Pictures/xx.jpg

6<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|> all the dogs are black and white<|eot_id|><|eot_id|><|eot_id|><|eot_id|> except for 2<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>

Blaizzy · 2024-05-14T09:37:18Z

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

Blaizzy · 2024-05-14T09:38:23Z

Regarding the answer the model gave

I'm yet to update the quantisation. I will ping you once I update it.

cmgzy · 2024-05-14T09:40:02Z

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

It fixed!

Blaizzy · 2024-05-14T09:42:02Z

Fantastic!

I'm uploading llava 8bit then I will patch other models 👌🏽

cmgzy · 2024-05-15T11:12:45Z

Did you test the transformer versions of the previous models you reported ?

Could you share sample code for testing transformer versions using Apple Silicon GPUs? So that I may help you test other models from then on. Thx!

Blaizzy · 2024-05-15T23:35:14Z

Here you go:

from transformers import AutoProcessor, AutoModelForPreTraining
from PIL import Image
import requests
import torch

model_id = "llava-hf/llava-1.5-7b-hf"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = AutoModelForPreTraining.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)

cmgzy · 2024-05-16T02:41:27Z

Here you go:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "mlx-community/llava-1.5-7b-4bit"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)

Which model should I use to load "mlx-community/llava-1.5-7b-4bit", may not be PaliGemmaForConditionalGeneration...

Blaizzy · 2024-05-16T05:50:33Z

Sorry, fixed it ✅

Use AutoModelForPreTraining

Blaizzy mentioned this issue May 14, 2024

Add support for PaliGemma and Quant Siglip #24

Merged

Blaizzy closed this as completed in #24 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too sensitive to prompting #23

Too sensitive to prompting #23

cmgzy commented May 9, 2024

Blaizzy commented May 9, 2024

cmgzy commented May 13, 2024

Blaizzy commented May 13, 2024

Blaizzy commented May 13, 2024

cmgzy commented May 13, 2024

Blaizzy commented May 13, 2024

Blaizzy commented May 13, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 15, 2024

Blaizzy commented May 15, 2024 •

edited

cmgzy commented May 16, 2024

Blaizzy commented May 16, 2024

Too sensitive to prompting #23

Too sensitive to prompting #23

Comments

cmgzy commented May 9, 2024

Blaizzy commented May 9, 2024

cmgzy commented May 13, 2024

Blaizzy commented May 13, 2024

Blaizzy commented May 13, 2024

cmgzy commented May 13, 2024

Blaizzy commented May 13, 2024

Blaizzy commented May 13, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

The response is weird...

Blaizzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 14, 2024

Blaizzy commented May 14, 2024

cmgzy commented May 15, 2024

Blaizzy commented May 15, 2024 • edited

cmgzy commented May 16, 2024

Blaizzy commented May 16, 2024

Blaizzy commented May 15, 2024 •

edited