[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

tctrautman · 2024-03-21T21:30:42Z

Building on the amazing work by @mzbac and @nkasmanoff in #461, I'd really love an example of how LLaVA 1.6 (aka llava next) can be fine-tuned with a LoRA.

I might be able to make progress on this myself, but it'll take me some time. Any help or thoughts on how to best approach this would be appreciated. (Especially from @mzbac and/or @nkasmanoff.)

mzbac · 2024-03-22T03:49:31Z

I was waiting for the Llava 1.6 support to be merged into the transformer so we can have consistent model weight naming conventions. I haven't looked at the details of it, but the most challenging part would be in https://github.com/ml-explore/mlx-examples/blob/main/llava/llava.py#L104. Do you mind starting a draft PR for it? I am happy to help in any way if you get stuck on the implementation.

tctrautman · 2024-03-22T05:33:03Z

@mzbac sounds good! I'll take a look over the weekend and see how it goes -- a lot of this is going to be pretty new for me.

Quick question as I'm getting started: do you expect it to be complex to support both the Mistral and Vicuna models?

I ask because I see the line below where we've explicitly said we only support the Llama language model.

mlx-examples/llava/language.py

Line 210 in fbed720

if self.model_type != "llama":

I'm personally most interested in getting the Mistral 7B version working, but I'd be open to working with Vicuna if you expect that to be a better place to get started.

mzbac · 2024-03-22T06:08:36Z

The Mistral is using the same Llama architecture. If the merge image feature with input ids functions properly, the language model part should just work out of box.
You can take a look at the model weight here -> https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf
Also, check out this ASCII diagram I created for the llava 1.5 model: https://gist.github.com/mzbac/00ebe60bb36fa4d8f65509f8e47350d5

jrp2014 · 2024-03-22T22:35:53Z

I couldn't get 1.6 to work either as it said that only llama and not mistral models were supported.

On a further test, trying llava-hf/llava-v1.6-34b-hf, I got:

preprocessor_config.json: 100%|█████████████████████████████████████████████████████| 754/754 [00:00<00:00, 958kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████| 1.86k/1.86k [00:00<00:00, 4.99MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████| 1.03M/1.03M [00:00<00:00, 5.56MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████| 23.0/23.0 [00:00<00:00, 81.9kB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████| 748/748 [00:00<00:00, 3.56MB/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
config.json: 100%|█████████████████████████████████████████████████████████████| 1.41k/1.41k [00:00<00:00, 6.51MB/s]
Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/mytest.py", line 3, in <module>
    processor, model = load_model("llava-hf/llava-v1.6-34b-hf")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/generate.py", line 83, in load_model
    processor = AutoProcessor.from_pretrained(model_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/processing_auto.py", line 340, in from_pretrained
    raise ValueError(
ValueError: Unrecognized processing class in llava-hf/llava-v1.6-34b-hf. Can't instantiate a processor, a tokenizer, an image processor or a feature extractor for this model. Make sure the repository contains the files of at least one of those processing classes.

... updating the transformers library and installing protobuf seems to be more promising ...

tctrautman · 2024-03-23T01:12:49Z

@jrp2014 I ran into a similar processor error before I realized I hadn't updated HF transformers -- updating to 4.39.1 resolved it on my end.

jrp2014 · 2024-03-23T11:32:14Z

I now get

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Fetching 23 files: 100%|█████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 28676.87it/s]
Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/mytest.py", line 3, in <module>
    processor, model = load_model("llava-hf/llava-v1.6-34b-hf")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/generate.py", line 84, in load_model
    model = LlavaModel.from_pretrained(model_path)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/llava.py", line 178, in from_pretrained
    model.load_weights(list(weights.items()))
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx/nn/layers/base.py", line 203, in load_weights
    raise ValueError(f"Received parameters not in model: {extras}.")
ValueError: Received parameters not in model: image_newline.
python mytest.py  1.41s user 3.73s system 171% cpu 3.002 total

I'll have a look to see whether there are any obvious fixes.

nkasmanoff · 2024-03-23T15:18:15Z

Hey great initiative! Excited to try out these new and improved LlaVA models. Agreed that using the standard transformers format makes it easier.

I've been thinking a bit about the fine tuning aspect, and can share some code snippets I have so far which might help the discussion.

There's a ton of variety in fine-tuning the vision models, but to me easiest direction would be starting with solely fine-tuning the language model. If you wanted to do just that I am following the code from https://github.com/ml-explore/mlx-examples/tree/main/llms, where at least to start, you can attach LoRA layers to the llm. Here's how I have that for now.

model.vision_tower.freeze()
model.multi_modal_projector.freeze()
model.language_model.freeze()

lora_layers = 32
lora_parameters = {
    "keys": ["self_attn.q_proj", "self_attn.v_proj", "self_attn.k_proj", "self_attn.out_proj"],
    "rank": 64,
    "alpha": 16.0,
    "scale": 10.0,
    "dropout": 0.0,
}

linear_to_lora_layers(model.language_model.model, lora_layers, lora_parameters)

print_trainable_parameters(model.vision_tower)
print_trainable_parameters(model.multi_modal_projector)
print_trainable_parameters(model.language_model)

# Trainable parameters: 0.000% (0.000M/303.506M)
# Trainable parameters: 0.000% (0.000M/20.980M)
# Trainable parameters: 0.747% (50.332M/6738.940M)

I'm still not sure what the best approach would be for the language modeling piece. Maybe you mask everything except the answer? The fact that the shape of the inputs change once you insert all the image embeddings makes this tough and I haven't found any proper demos at least so far. The caveat being if you want your vision tower to get better, maybe you fine-tune that separately or already start with a with a CLIP-like model already fine-tuned to a different domain?

jrp2014 · 2024-03-23T15:50:48Z

So for 1.6 it seems that we need to produce an image_newline token at the end of each image row. (I have no idea whether it is each actual row or what the role of patches is …).

tctrautman · 2024-03-23T16:51:55Z

I’m unfortunately busy and won’t be able to take a closer look again until later today, but re: fine-tuning demos, this video might be helpful: https://www.youtube.com/watch?v=eIziN2QUt8U

jrp2014 · 2024-03-23T18:11:23Z

Thanks. If I cheat, and add a False to the load model, to let it ignore the model parameter (image_newline) mismatch, I get:

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Fetching 23 files: 100%|███████████████████████████████████████████████| 23/23 [00:00<00:00, 319433.75it/s]
Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/mytest.py", line 13, in <module>
    reply = generate_text(
            ^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/generate.py", line 97, in generate_text
    logits, cache = model(input_ids, pixel_values)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/llava.py", line 135, in __call__
    input_embddings = self.get_input_embeddings(input_ids, pixel_values)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-examples/llava/llava.py", line 79, in get_input_embeddings
    pixel_values.transpose(0, 2, 3, 1), output_hidden_states=True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Transpose axes don't match array dimensions.

mzbac · 2024-03-24T08:56:30Z

We need to preprocess the image feature before merging it with the input and image features. You can find more information at https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L514-L553.

I personally feel it would be better to copy the existing llava 1.5 and create a new example for 1.6, as there have been quite a few changes introduced in llava 1.6.

nkasmanoff · 2024-03-24T18:33:35Z

I’m unfortunately busy and won’t be able to take a closer look again until later today, but re: fine-tuning demos, this video might be helpful: https://www.youtube.com/watch?v=eIziN2QUt8U

It did, in so that it confirmed the only way to do this it seems is to train on what comes in the answer, or after the image. With that in mind I have a quick implementation working here, which finetunes on a dataset where every token after the one is trained.

Since LlaVA works with in arbitrary places this would be a cool enhancement, but not feasible in how it's set up from what I could tell.

tctrautman · 2024-03-24T22:56:07Z

Unfortunately, it's taking me longer than I had hoped to get up to speed -- I still might be able to get a draft PR up but it won't be for a while longer.

ahmetkca · 2024-04-11T20:12:27Z

I am really interested in this work. However, I don't know where to start though. Have you got time to advance by any chance @tctrautman? I know you are doing this in your spare time and we are all grateful for that. Please let me know, if I can be helpful for anything.

tctrautman · 2024-04-12T19:50:42Z

@ahmetkca Unfortunately I haven't had time to make any progress on this, and I'm honestly not sure when I will 😞

jrp2014 mentioned this issue Mar 30, 2024

Add support for llava-hf/llava-v1.6-mistral-7b-hf #638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

tctrautman commented Mar 21, 2024

mzbac commented Mar 22, 2024

tctrautman commented Mar 22, 2024

mzbac commented Mar 22, 2024

jrp2014 commented Mar 22, 2024 •

edited

tctrautman commented Mar 23, 2024

jrp2014 commented Mar 23, 2024

nkasmanoff commented Mar 23, 2024

jrp2014 commented Mar 23, 2024

tctrautman commented Mar 23, 2024 via email •

edited

jrp2014 commented Mar 23, 2024

mzbac commented Mar 24, 2024

nkasmanoff commented Mar 24, 2024

tctrautman commented Mar 24, 2024

ahmetkca commented Apr 11, 2024

tctrautman commented Apr 12, 2024

[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

Comments

tctrautman commented Mar 21, 2024

mzbac commented Mar 22, 2024

tctrautman commented Mar 22, 2024

mzbac commented Mar 22, 2024

jrp2014 commented Mar 22, 2024 • edited

tctrautman commented Mar 23, 2024

jrp2014 commented Mar 23, 2024

nkasmanoff commented Mar 23, 2024

jrp2014 commented Mar 23, 2024

tctrautman commented Mar 23, 2024 via email • edited

jrp2014 commented Mar 23, 2024

mzbac commented Mar 24, 2024

nkasmanoff commented Mar 24, 2024

tctrautman commented Mar 24, 2024

ahmetkca commented Apr 11, 2024

tctrautman commented Apr 12, 2024

jrp2014 commented Mar 22, 2024 •

edited

tctrautman commented Mar 23, 2024 via email •

edited