[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out #646

davidgxue · 2024-04-20T04:03:44Z

Describe the bug
I initially discovered the issue when testing the quantized model with oobabooga's text-generation-webui. When running inference on the the GPTQ quant of Llama 3 I get logs below

Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/modules/text_generation.py", line 382, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/auto_gptq/modeling/_base.py", line 447, in generate
    return self.model.generate(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
              ^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1208, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1018, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 741, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 56, in forward
    query_states, key_states, value_states = torch.split(qkv_states, self.hidden_size, dim=2)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)

Hardware details
Running quantized model on Nvidia T4, quantized using A100 40G. I don't think hardware matters here as I tested a few.

Software version
AutoGPTQ 0.7.1

To Reproduce

Quantize using AutoGPTQ for Llama 3 8B instruct with Act Order = True, Group size = 32, Bits = 8
Load the quantized model with inject_fused_attention=True vs with with inject_fused_attention=False

Expected behavior
No error should occur. Unless I misunderstood I thought this issue was addressed during Llama 2 70B when TheBloke first raised it?

The text was updated successfully, but these errors were encountered:

LaaZa · 2024-04-20T15:19:23Z

Llama 3 is a new model so maybe the fused attention does not work with it. It can be enabled because the model identifies itself as llama.

davidgxue · 2024-04-22T17:51:16Z

Llama 2 introduced Grouped Multi-query attention which AutoGPTQ previously ran into the same error as this (erroring when inject_fused_attention=True getting ValueError: not enough values to unpack (expected 3, got 2)). But that issue has been fixed or has it not been yet? I can still see that issue being open here by TheBloke previously #210. It is still open, but based on his PRs/commits looks like that issue has been addressed? If it has not been addressed then this error make sense. Can you confirm?

I am happy to look into this if we can confirm the nature of this issue

LaaZa · 2024-04-22T20:21:34Z

The PR does not fix that issue, someone just mentions the issue. I don't know if it supposed to be fixed elsewhere. Only 70B used GQA so it did not come up that often.

#237 should fix it so I assume it isn't fixed.

davidgxue · 2024-04-22T21:37:35Z

oh didn't realize #237 is there. Also interesting that we have to disable fused attention when both exllama and act-order are activated.

Looks like this PR has been there for like over half of a year now. Is there help needed? I m happy to help out. I can contact the PR owner as well

LaaZa · 2024-04-22T21:43:56Z

I'm trying to see if I can get it updated and working.

LaaZa · 2024-04-23T01:47:39Z

Okay so the situation is the following: Fused attention does not seem to work at all due to transformers changes and especially the cache, it might work without cache but that is not a good idea.

I'm not yet sure if I can figure out a solution and seems fused attn and mlp are a bit abandoned functionality #573

You could try marlin format to improve peformance but it only supports 4-bit, I would not recommend 8-bit AutoGPTQ anyway, bitsandbytes should work fine for that.

davidgxue · 2024-04-25T04:41:50Z

Is there a reason why AutoGPTQ 8 bit is not as recommended for 8 bit configuration? Is it for inference speed/performance or more of a question on accuracy? But regardless, thanks for looking into this. I think fused attn may be more and more common going forward with new model releases though so maybe at some point it's worth revisiting.

LaaZa · 2024-04-25T06:23:23Z

It's not that it would be worse, but a lot more hassle for no real benefit. Also all of the kernels are focused on 4-bit. You can just load a model in 8-bit with bnb and it's good, gptq at 8-bit doesn't support fastest kernels and you have to prequantize it. It's a bit smaller for sure but disk space is hardly an issue if you have the hardware to run the model at 8-bit.

davidgxue added the bug Something isn't working label Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out #646

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out #646

davidgxue commented Apr 20, 2024

LaaZa commented Apr 20, 2024

davidgxue commented Apr 22, 2024

LaaZa commented Apr 22, 2024 •

edited

davidgxue commented Apr 22, 2024

LaaZa commented Apr 22, 2024

LaaZa commented Apr 23, 2024 •

edited

davidgxue commented Apr 25, 2024 •

edited

LaaZa commented Apr 25, 2024

[BUG] Llama 3 8B Instruct - no_inject_fused_attention must be true or else errors out #646

[BUG] Llama 3 8B Instruct - no_inject_fused_attention must be true or else errors out #646

Comments

davidgxue commented Apr 20, 2024

LaaZa commented Apr 20, 2024

davidgxue commented Apr 22, 2024

LaaZa commented Apr 22, 2024 • edited

davidgxue commented Apr 22, 2024

LaaZa commented Apr 22, 2024

LaaZa commented Apr 23, 2024 • edited

davidgxue commented Apr 25, 2024 • edited

LaaZa commented Apr 25, 2024

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out #646

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out #646

LaaZa commented Apr 22, 2024 •

edited

LaaZa commented Apr 23, 2024 •

edited

davidgxue commented Apr 25, 2024 •

edited