Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers `model.generate()` but works fine in vLLM? #657

davidgxue · 2024-04-28T04:36:17Z

Problem description

Hi friends, hope someone can help out or point me in the right direction here. I feel like this maybe an integration thing with transformers? I can't understand why this spits out gibberish in transformers but vLLM works just fine. I thought it maybe decoding strategy/sampling related but that doesn't feel right either considering my following super odd observations:

This 8 bit Llama 3 quantized model works perfectly fine with vLLM for inference. No gibberish with exact same params and prompt. But gets gibberish for both huggingface transformers's model.generate() and the text generation pipeline.
I re-quantized it again with different package versions. Same problem: gibberish if using transformers, but works fine with vLLM inference.
I also made a 4 bit quant model using the same dataset, same environment, same script, same setup, yet that 4 bit model works fine with both vLLM and transformers. No gibberish when using transformers. Basically no issues at all. I have listed the 4 bit model below as well. This is the part that gets me most confused...

8 bit quant config (gibberish 8 bit model)

quantize_config = BaseQuantizeConfig(
        bits=8,
        group_size=32,
        desc_act=True,
        damp_percent=0.1,
        true_sequential=True,
        sym=True,
)

In comparison to the 4 bit model quant config (the model that works fine without gibberish)

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=True,
        damp_percent=0.1,
        true_sequential=True,
        sym=True,
)

Quantization dataset

500 rows or random pieces of wikitext
I don't think the dataset is the problem because the 4 bit model that works fine with model.generate() was using the exact same dataset.

Inference script:

import torch
import transformers
print(transformers.__version__)

from transformers import AutoModelForCausalLM, AutoTokenizer, LocalAgent, GPTQConfig, Tool, pipeline
model = AutoModelForCausalLM.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", torch_dtype=torch.bfloat16, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Who made you?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda:0')
print(tokenizer.decode(tokenized_chat[0]))
output = model.generate(inputs=tokenized_chat, temperature=0.7, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Who made you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

_<? Shoreuros_<?_<? fø_<?�470_<?_<? тру_<?utzer�_<?_<?_<?utzeremmeimersonse_<?_<?_<?_<?ribaimersutzer_<?emouth_<?_<?_<?_<?utzer姫utzer_<?utzerregon snaimers469_<?_<?_<?_<?_<?_<?utzerregonutzerutzerutzerutzerutzer_<?utzer-reduxutzeronseutzerutzer_<?utzer_<?emmeimers_<?_<?_<?utzeronse_<?utzer_<?_<? snaimers cubutzer snaimers_<?_<?_<?_<?utzerutzerecessonse_<?_<?_<?_<?_<?_<?_<? Shore

Also tried using AutoGPTQ to load model, also gibberish

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
gptq_model = AutoGPTQForCausalLM.from_quantized("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", device="cuda:0")
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda:0')
print(tokenizer.decode(tokenized_chat[0]))
output = gptq_model.generate(inputs=tokenized_chat, temperature=0.7, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Software Versions:

transformers: tried both 4.38.2 and 4.40.0dev
auto_gptq: 0.7.1
optimum: 1.19.1

The text was updated successfully, but these errors were encountered:

Qubitium · 2024-04-28T09:18:27Z

@davidgxue Use latest 4.40.1 or even latest release. They just fixed a llama generate issue regression that I encountered. This bug is specific to transformers and llama.

#614

davidgxue · 2024-04-30T19:29:24Z

@Qubitium I tried 4.40.1, it has the same problem. I am also already installing directly from the transformers github repo. (4.41.0.dev0)

And to add to the above. I get gibberish if I use bfloat16 but if I use float16 I get nan logits.
So an error message like this probability tensor contains either inf, nan or element < 0. This is also noted by someone else here: https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/discussions/5.

What is very interesting is I was working on this PR (#651) to extend AutoGPTQ support to Phi 3, and I got asked to post Perplexity results. For simplicity, I used AutoGPTQ's benchmarking script (uses float16 internally) and that script is producing nan perplexity on the GPTQ 8 Bit model (but original model is fine). After inserting some debugging code, I found that all the logits are nan produced by this GPTQ quant.

But given that huggingface/transformers#30380 has already been merged into the transfromers code base, and I am building from latest source, is this an implication something else is broken between transformers and AutoGPTQ?

And to add to this, when I was working on the extending support for Phi 3 PR, I had transformers-4.41.0.dev0 (but I am having trouble with 4.41.0.dev0 as well as that's what I am using right now), this worked fine before as there was no gibberish inference and no nan logits. Now if I run the exact same inference script, it produces nan logits or gibberish depending on the dtype.

davidgxue · 2024-04-30T19:51:03Z

So I feel like may not be related to huggingface/transformers#30380? Since it's already merged? or something else got added that broke things?

So far we know transformers 4.38.2, 4.40.1, and 4.41.0.dev0 (current version of dev0) are broken... Just compiling things for reference

By the way, Phi-3 uses LLAMA 2 architecture sooo this maybe a llama family related problem still...

fxmarty · 2024-04-30T19:58:52Z

maybe related huggingface/transformers#27179

probability tensor contains either inf, nan or element < 0 is a common issue that I have even witnessed with unquantized models. I did not find the root cause though. Maybe some fp16/bf16/autocast behavior.

I'll try to have a look if I get the time to.

davidgxue · 2024-04-30T20:07:20Z

Yeah so I looked into nan logits... the thing is most people are able to get around it by loading in bfloat16 and that behavior looks normal and correct and yes it's mostly base models. In our case, the base model is fine, but quantized model is broken.

Notably also, in this scenario, both dtypes seem to be broken, load in float16 it gives nan logits, load in bfloat16 it gives gibberish. vLLM and other text gen engines (based on other people's report) behave just fine... so yeah definitely something with transformers and it is super super recently and consistent. I have a feeling it may not be related to the older posts about nan logits that we can find... Anyway, I will look into this as well. I don't have as good of a knowledge regarding AutoGPTQ so it will be slower, but I can at least say for sure all 8 bit quants related to llama family models are consistently affected...

davidgxue mentioned this issue Apr 30, 2024

Extend support for Phi-3 models #651

Open

davidgxue mentioned this issue May 2, 2024

Why LLaMA3-8B after GPTQ test in wikitext2 so bad? #650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers `model.generate()` but works fine in vLLM? #657

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers `model.generate()` but works fine in vLLM? #657

davidgxue commented Apr 28, 2024

Qubitium commented Apr 28, 2024

davidgxue commented Apr 30, 2024 •

edited

davidgxue commented Apr 30, 2024 •

edited

fxmarty commented Apr 30, 2024 •

edited

davidgxue commented Apr 30, 2024 •

edited

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers model.generate() but works fine in vLLM? #657

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers model.generate() but works fine in vLLM? #657

Comments

davidgxue commented Apr 28, 2024

Problem description

8 bit quant config (gibberish 8 bit model)

In comparison to the 4 bit model quant config (the model that works fine without gibberish)

Quantization dataset

Inference script:

Software Versions:

Qubitium commented Apr 28, 2024

davidgxue commented Apr 30, 2024 • edited

davidgxue commented Apr 30, 2024 • edited

fxmarty commented Apr 30, 2024 • edited

davidgxue commented Apr 30, 2024 • edited

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers `model.generate()` but works fine in vLLM? #657

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers `model.generate()` but works fine in vLLM? #657

davidgxue commented Apr 30, 2024 •

edited

davidgxue commented Apr 30, 2024 •

edited

fxmarty commented Apr 30, 2024 •

edited

davidgxue commented Apr 30, 2024 •

edited