Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522

N3RDIUM · 2024-04-30T06:17:13Z

Hey there! I'm trying to run llama3-8b-instruct with intel extension for transformers.

Here's my code:

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    load_in_4bit=True,
    attn_implementation="flash_attention_2",
    device_map="cpu"
)

messages = [
    {"role": "system", "content": "You are a JSON chatbot who always responds with JSON in the following format: {'message': 'your message here!'}"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Here's the error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-04-30 11:46:44 [INFO] cpu device is used.
2024-04-30 11:46:44 [INFO] Applying Weight Only Quantization.
2024-04-30 11:46:44 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model

The text was updated successfully, but these errors were encountered:

kevinintel · 2024-05-01T15:21:45Z

Thanks for reporting it, we will check the issue

Zhenzhong1 · 2024-05-06T02:37:35Z

@N3RDIUM Hi, according to errors

Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems you did't download the model successfully. Please download the model from HF to the local disk and try again.

Just setting the model_id to the local path.

model_id = "/home/model/llama3_8b_instruct-chat"

Another issue is that the variable of model.device you didn't define

N3RDIUM · 2024-05-17T06:01:04Z

I tried downloading the model again and using the local path as the model ID, but it gives me this error now:

2024-05-17 11:29:11 [INFO] cpu device is used.
2024-05-17 11:29:11 [INFO] Applying Weight Only Quantization.
2024-05-17 11:29:11 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', '/home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/']
Loadding the model from the local path.
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1489, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main
    vocab = load_vocab(vocab_dir, params.n_vocab)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1380, in load_vocab
    raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298 or its parent; if it's in another directory,                 pass the directory as --vocab-dir
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model```

N3RDIUM · 2024-05-17T06:06:56Z

Does this lib support *.pth models? I could go for the original/ dir: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main/original

Zhenzhong1 · 2024-05-17T06:22:13Z

@N3RDIUM

Hi,

File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main vocab = load_vocab(vocab_dir, params.n_vocab)

The code you provided may be incompatible, whcih means ITREX or Neural Spedd verison is a little bit old. https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py

I ran the code successfully last time I replied you~. Please try to reinstall the latest main bracnh ITREX and neural speed from the souce code~

N3RDIUM · 2024-05-17T07:06:39Z

Okay, will try. Thanks for the quick reply!

N3RDIUM · 2024-05-17T07:29:33Z

Its running out of memory on python -m neural_speed.convert.convert_llama --outfile runtime_outs/ne_llama_f16.bin --outtype f16 --model_hub huggingface meta-llama/Meta-Llama-3-8B-Instruct

N3RDIUM · 2024-05-17T07:30:21Z

Whoops! Closed it by mistake. Anyway, is there any way to reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

N3RDIUM · 2024-05-17T07:38:51Z

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

Zhenzhong1 · 2024-05-17T08:05:27Z

Hi, @N3RDIUM

reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

All people use the same function to load the model from the HF:
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
attn_implementation="flash_attention_2",
device_map="cpu"
)

The possible different is that the https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py#L1485

Please set the low_cpu_usage_mem=False before installation. According to my tests previously, it can reduce virtual memory sometimes.

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

No worries. Just setting the new conda env and reinstall the requirement.txt and ITREX+NS from the souce code. Theses issues will disappear I think. I have checked the installation pipeline again by using the latest ITREX and NS branch. It works.

Convert:

Quant:

Inference:

successful Installation screenshots(Check whether you install successfully)
ITREX:

NS:

Version:

N3RDIUM · 2024-05-17T08:26:19Z

I have the same versions as you, yet it gives me the same error: AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

N3RDIUM · 2024-05-17T08:26:36Z

Oops, did it again, extremely sorry

N3RDIUM · 2024-05-17T08:27:54Z

I'm not using conda, just python venv. Does that have something to do with this?

N3RDIUM · 2024-05-17T08:29:05Z

Here is the error now:

(.venv) .venv ❯ /mnt/code/Code/jarvis/.venv/bin/python /mnt/code/Code/jarvis/llama3.py
_zsh_autosuggest_highlight_reset:3: maximum nested function level reached; increase FUNCNEST?
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-17 13:58:42 [INFO] cpu device is used.
2024-05-17 13:58:42 [INFO] Applying Weight Only Quantization.
2024-05-17 13:58:42 [INFO] Quantize model by Neural Speed with RTN Algorithm.
The model_type: Llama3.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.01it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1526, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1490, in main
    cache_path = Path(tokenizer.vocab_file).parent
                      ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 633, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 205, in init
    convert_model(model_name, fp32_bin, "f32", model_hub=model_hub)
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/__init__.py", line 55, in convert_model
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']' returned non-zero exit status 1.

N3RDIUM · 2024-05-17T09:44:37Z

Which version of transformers and pytorch are you on?

Zhenzhong1 · 2024-05-20T01:23:45Z

AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
this error looks about transformers probably.

try this

Ujjawal-K-Panchal · 2024-05-22T23:48:24Z

Facing the same issue for the given Dockerfile.

kevinintel assigned Zhenzhong1 May 1, 2024

N3RDIUM closed this as completed May 17, 2024

N3RDIUM reopened this May 17, 2024

N3RDIUM closed this as completed May 17, 2024

N3RDIUM reopened this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522

Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522

N3RDIUM commented Apr 30, 2024

kevinintel commented May 1, 2024

Zhenzhong1 commented May 6, 2024 •

edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 17, 2024 •

edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 17, 2024 •

edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 20, 2024

Ujjawal-K-Panchal commented May 22, 2024

Cannot run llama3 8b instruct: AssertionError: Fail to convert pytorch model #1522

Cannot run llama3 8b instruct: AssertionError: Fail to convert pytorch model #1522

Comments

N3RDIUM commented Apr 30, 2024

kevinintel commented May 1, 2024

Zhenzhong1 commented May 6, 2024 • edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 17, 2024 • edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 17, 2024 • edited

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

N3RDIUM commented May 17, 2024

Zhenzhong1 commented May 20, 2024

Ujjawal-K-Panchal commented May 22, 2024

Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522

Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522

Zhenzhong1 commented May 6, 2024 •

edited

Zhenzhong1 commented May 17, 2024 •

edited

Zhenzhong1 commented May 17, 2024 •

edited