-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Llama 3 8B Instruct - no_inject_fused_attention
must be true or else errors out
#646
Comments
Llama 3 is a new model so maybe the fused attention does not work with it. It can be enabled because the model identifies itself as |
Llama 2 introduced Grouped Multi-query attention which AutoGPTQ previously ran into the same error as this (erroring when I am happy to look into this if we can confirm the nature of this issue |
The PR does not fix that issue, someone just mentions the issue. I don't know if it supposed to be fixed elsewhere. Only 70B used GQA so it did not come up that often. #237 should fix it so I assume it isn't fixed. |
oh didn't realize #237 is there. Also interesting that we have to disable fused attention when both exllama and act-order are activated. Looks like this PR has been there for like over half of a year now. Is there help needed? I m happy to help out. I can contact the PR owner as well |
I'm trying to see if I can get it updated and working. |
Okay so the situation is the following: Fused attention does not seem to work at all due to transformers changes and especially the cache, it might work without cache but that is not a good idea. I'm not yet sure if I can figure out a solution and seems fused attn and mlp are a bit abandoned functionality #573 You could try marlin format to improve peformance but it only supports 4-bit, I would not recommend 8-bit AutoGPTQ anyway, bitsandbytes should work fine for that. |
Is there a reason why AutoGPTQ 8 bit is not as recommended for 8 bit configuration? Is it for inference speed/performance or more of a question on accuracy? But regardless, thanks for looking into this. I think fused attn may be more and more common going forward with new model releases though so maybe at some point it's worth revisiting. |
It's not that it would be worse, but a lot more hassle for no real benefit. Also all of the kernels are focused on 4-bit. You can just load a model in 8-bit with bnb and it's good, gptq at 8-bit doesn't support fastest kernels and you have to prequantize it. It's a bit smaller for sure but disk space is hardly an issue if you have the hardware to run the model at 8-bit. |
Describe the bug
I initially discovered the issue when testing the quantized model with oobabooga's text-generation-webui. When running inference on the the GPTQ quant of Llama 3 I get logs below
Hardware details
Running quantized model on Nvidia T4, quantized using A100 40G. I don't think hardware matters here as I tested a few.
Software version
AutoGPTQ 0.7.1
To Reproduce
inject_fused_attention=True
vs withwith inject_fused_attention=False
Expected behavior
No error should occur. Unless I misunderstood I thought this issue was addressed during Llama 2 70B when TheBloke first raised it?
The text was updated successfully, but these errors were encountered: