Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Llama 3 8B Instruct - no_inject_fused_attention must be true or else errors out #646

Open
davidgxue opened this issue Apr 20, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@davidgxue
Copy link

Describe the bug
I initially discovered the issue when testing the quantized model with oobabooga's text-generation-webui. When running inference on the the GPTQ quant of Llama 3 I get logs below

Traceback (most recent call last):
  File "/home/user/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/modules/text_generation.py", line 382, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/auto_gptq/modeling/_base.py", line 447, in generate
    return self.model.generate(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
              ^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1208, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1018, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 741, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 56, in forward
    query_states, key_states, value_states = torch.split(qkv_states, self.hidden_size, dim=2)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)

Hardware details
Running quantized model on Nvidia T4, quantized using A100 40G. I don't think hardware matters here as I tested a few.

Software version
AutoGPTQ 0.7.1

To Reproduce

  1. Quantize using AutoGPTQ for Llama 3 8B instruct with Act Order = True, Group size = 32, Bits = 8
  2. Load the quantized model with inject_fused_attention=True vs with with inject_fused_attention=False

Expected behavior
No error should occur. Unless I misunderstood I thought this issue was addressed during Llama 2 70B when TheBloke first raised it?

@davidgxue davidgxue added the bug Something isn't working label Apr 20, 2024
@LaaZa
Copy link
Contributor

LaaZa commented Apr 20, 2024

Llama 3 is a new model so maybe the fused attention does not work with it. It can be enabled because the model identifies itself as llama.

@davidgxue
Copy link
Author

Llama 2 introduced Grouped Multi-query attention which AutoGPTQ previously ran into the same error as this (erroring when inject_fused_attention=True getting ValueError: not enough values to unpack (expected 3, got 2)). But that issue has been fixed or has it not been yet? I can still see that issue being open here by TheBloke previously #210. It is still open, but based on his PRs/commits looks like that issue has been addressed? If it has not been addressed then this error make sense. Can you confirm?

I am happy to look into this if we can confirm the nature of this issue

@LaaZa
Copy link
Contributor

LaaZa commented Apr 22, 2024

The PR does not fix that issue, someone just mentions the issue. I don't know if it supposed to be fixed elsewhere. Only 70B used GQA so it did not come up that often.

#237 should fix it so I assume it isn't fixed.

@davidgxue
Copy link
Author

oh didn't realize #237 is there. Also interesting that we have to disable fused attention when both exllama and act-order are activated.

Looks like this PR has been there for like over half of a year now. Is there help needed? I m happy to help out. I can contact the PR owner as well

@LaaZa
Copy link
Contributor

LaaZa commented Apr 22, 2024

I'm trying to see if I can get it updated and working.

@LaaZa
Copy link
Contributor

LaaZa commented Apr 23, 2024

Okay so the situation is the following: Fused attention does not seem to work at all due to transformers changes and especially the cache, it might work without cache but that is not a good idea.

I'm not yet sure if I can figure out a solution and seems fused attn and mlp are a bit abandoned functionality #573

You could try marlin format to improve peformance but it only supports 4-bit, I would not recommend 8-bit AutoGPTQ anyway, bitsandbytes should work fine for that.

@davidgxue
Copy link
Author

davidgxue commented Apr 25, 2024

Is there a reason why AutoGPTQ 8 bit is not as recommended for 8 bit configuration? Is it for inference speed/performance or more of a question on accuracy? But regardless, thanks for looking into this. I think fused attn may be more and more common going forward with new model releases though so maybe at some point it's worth revisiting.

@LaaZa
Copy link
Contributor

LaaZa commented Apr 25, 2024

It's not that it would be worse, but a lot more hassle for no real benefit. Also all of the kernels are focused on 4-bit. You can just load a model in 8-bit with bnb and it's good, gptq at 8-bit doesn't support fastest kernels and you have to prequantize it. It's a bit smaller for sure but disk space is hardly an issue if you have the hardware to run the model at 8-bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants