Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers model.generate() but works fine in vLLM? #657

Open
davidgxue opened this issue Apr 28, 2024 · 5 comments

Comments

@davidgxue
Copy link

Problem description

Hi friends, hope someone can help out or point me in the right direction here. I feel like this maybe an integration thing with transformers? I can't understand why this spits out gibberish in transformers but vLLM works just fine. I thought it maybe decoding strategy/sampling related but that doesn't feel right either considering my following super odd observations:

  1. This 8 bit Llama 3 quantized model works perfectly fine with vLLM for inference. No gibberish with exact same params and prompt. But gets gibberish for both huggingface transformers's model.generate() and the text generation pipeline.

  2. I re-quantized it again with different package versions. Same problem: gibberish if using transformers, but works fine with vLLM inference.

  3. I also made a 4 bit quant model using the same dataset, same environment, same script, same setup, yet that 4 bit model works fine with both vLLM and transformers. No gibberish when using transformers. Basically no issues at all. I have listed the 4 bit model below as well. This is the part that gets me most confused...

8 bit quant config (gibberish 8 bit model)

quantize_config = BaseQuantizeConfig(
        bits=8,
        group_size=32,
        desc_act=True,
        damp_percent=0.1,
        true_sequential=True,
        sym=True,
)

In comparison to the 4 bit model quant config (the model that works fine without gibberish)

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=True,
        damp_percent=0.1,
        true_sequential=True,
        sym=True,
)

Quantization dataset

  • 500 rows or random pieces of wikitext
  • I don't think the dataset is the problem because the 4 bit model that works fine with model.generate() was using the exact same dataset.

Inference script:

import torch
import transformers
print(transformers.__version__)

from transformers import AutoModelForCausalLM, AutoTokenizer, LocalAgent, GPTQConfig, Tool, pipeline
model = AutoModelForCausalLM.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", torch_dtype=torch.bfloat16, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Who made you?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda:0')
print(tokenizer.decode(tokenized_chat[0]))
output = model.generate(inputs=tokenized_chat, temperature=0.7, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Who made you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

_<? Shoreuros_<?_<? fø_<?�470_<?_<? тру_<?utzer�_<?_<?_<?utzeremmeimersonse_<?_<?_<?_<?ribaimersutzer_<?emouth_<?_<?_<?_<?utzer姫utzer_<?utzerregon snaimers469_<?_<?_<?_<?_<?_<?utzerregonutzerutzerutzerutzerutzer_<?utzer-reduxutzeronseutzerutzer_<?utzer_<?emmeimers_<?_<?_<?utzeronse_<?utzer_<?_<? snaimers cubutzer snaimers_<?_<?_<?_<?utzerutzerecessonse_<?_<?_<?_<?_<?_<?_<? Shore

Also tried using AutoGPTQ to load model, also gibberish

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
gptq_model = AutoGPTQForCausalLM.from_quantized("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", device="cuda:0")
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda:0')
print(tokenizer.decode(tokenized_chat[0]))
output = gptq_model.generate(inputs=tokenized_chat, temperature=0.7, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Software Versions:

transformers: tried both 4.38.2 and 4.40.0dev
auto_gptq: 0.7.1
optimum: 1.19.1

@Qubitium
Copy link
Contributor

@davidgxue Use latest 4.40.1 or even latest release. They just fixed a llama generate issue regression that I encountered. This bug is specific to transformers and llama.

#614

@davidgxue
Copy link
Author

davidgxue commented Apr 30, 2024

@Qubitium I tried 4.40.1, it has the same problem. I am also already installing directly from the transformers github repo. (4.41.0.dev0)

And to add to the above. I get gibberish if I use bfloat16 but if I use float16 I get nan logits.
So an error message like this probability tensor contains either inf, nan or element < 0. This is also noted by someone else here: https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/discussions/5.

What is very interesting is I was working on this PR (#651) to extend AutoGPTQ support to Phi 3, and I got asked to post Perplexity results. For simplicity, I used AutoGPTQ's benchmarking script (uses float16 internally) and that script is producing nan perplexity on the GPTQ 8 Bit model (but original model is fine). After inserting some debugging code, I found that all the logits are nan produced by this GPTQ quant.

But given that huggingface/transformers#30380 has already been merged into the transfromers code base, and I am building from latest source, is this an implication something else is broken between transformers and AutoGPTQ?

And to add to this, when I was working on the extending support for Phi 3 PR, I had transformers-4.41.0.dev0 (but I am having trouble with 4.41.0.dev0 as well as that's what I am using right now), this worked fine before as there was no gibberish inference and no nan logits. Now if I run the exact same inference script, it produces nan logits or gibberish depending on the dtype.

@davidgxue
Copy link
Author

davidgxue commented Apr 30, 2024

So I feel like may not be related to huggingface/transformers#30380? Since it's already merged? or something else got added that broke things?

So far we know transformers 4.38.2, 4.40.1, and 4.41.0.dev0 (current version of dev0) are broken... Just compiling things for reference

By the way, Phi-3 uses LLAMA 2 architecture sooo this maybe a llama family related problem still...

@fxmarty
Copy link
Collaborator

fxmarty commented Apr 30, 2024

maybe related huggingface/transformers#27179

probability tensor contains either inf, nan or element < 0 is a common issue that I have even witnessed with unquantized models. I did not find the root cause though. Maybe some fp16/bf16/autocast behavior.

I'll try to have a look if I get the time to.

@davidgxue
Copy link
Author

davidgxue commented Apr 30, 2024

Yeah so I looked into nan logits... the thing is most people are able to get around it by loading in bfloat16 and that behavior looks normal and correct and yes it's mostly base models. In our case, the base model is fine, but quantized model is broken.

Notably also, in this scenario, both dtypes seem to be broken, load in float16 it gives nan logits, load in bfloat16 it gives gibberish. vLLM and other text gen engines (based on other people's report) behave just fine... so yeah definitely something with transformers and it is super super recently and consistent. I have a feeling it may not be related to the older posts about nan logits that we can find... Anyway, I will look into this as well. I don't have as good of a knowledge regarding AutoGPTQ so it will be slower, but I can at least say for sure all 8 bit quants related to llama family models are consistently affected...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants