Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run llama3 8b instruct: AssertionError: Fail to convert pytorch model #1522

Open
N3RDIUM opened this issue Apr 30, 2024 · 17 comments
Open
Assignees

Comments

@N3RDIUM
Copy link

N3RDIUM commented Apr 30, 2024

Hey there! I'm trying to run llama3-8b-instruct with intel extension for transformers.

Here's my code:

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    load_in_4bit=True,
    attn_implementation="flash_attention_2",
    device_map="cpu"
)

messages = [
    {"role": "system", "content": "You are a JSON chatbot who always responds with JSON in the following format: {'message': 'your message here!'}"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Here's the error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-04-30 11:46:44 [INFO] cpu device is used.
2024-04-30 11:46:44 [INFO] Applying Weight Only Quantization.
2024-04-30 11:46:44 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model
@kevinintel
Copy link
Contributor

Thanks for reporting it, we will check the issue

@Zhenzhong1
Copy link
Collaborator

Zhenzhong1 commented May 6, 2024

@N3RDIUM Hi, according to errors

Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems you did't download the model successfully. Please download the model from HF to the local disk and try again.

Just setting the model_id to the local path.

model_id = "/home/model/llama3_8b_instruct-chat"

Another issue is that the variable of model.device you didn't define
image

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

I tried downloading the model again and using the local path as the model ID, but it gives me this error now:

2024-05-17 11:29:11 [INFO] cpu device is used.
2024-05-17 11:29:11 [INFO] Applying Weight Only Quantization.
2024-05-17 11:29:11 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', '/home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/']
Loadding the model from the local path.
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1489, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main
    vocab = load_vocab(vocab_dir, params.n_vocab)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1380, in load_vocab
    raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298 or its parent; if it's in another directory,                 pass the directory as --vocab-dir
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model```

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Does this lib support *.pth models? I could go for the original/ dir: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main/original

@Zhenzhong1
Copy link
Collaborator

Zhenzhong1 commented May 17, 2024

@N3RDIUM

Hi,

File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main vocab = load_vocab(vocab_dir, params.n_vocab)

The code you provided may be incompatible, whcih means ITREX or Neural Spedd verison is a little bit old. https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py

I ran the code successfully last time I replied you~. Please try to reinstall the latest main bracnh ITREX and neural speed from the souce code~

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Okay, will try. Thanks for the quick reply!

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Its running out of memory on python -m neural_speed.convert.convert_llama --outfile runtime_outs/ne_llama_f16.bin --outtype f16 --model_hub huggingface meta-llama/Meta-Llama-3-8B-Instruct

@N3RDIUM N3RDIUM closed this as completed May 17, 2024
@N3RDIUM N3RDIUM reopened this May 17, 2024
@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Whoops! Closed it by mistake. Anyway, is there any way to reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

@Zhenzhong1
Copy link
Collaborator

Zhenzhong1 commented May 17, 2024

Hi, @N3RDIUM

reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

All people use the same function to load the model from the HF:
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
attn_implementation="flash_attention_2",
device_map="cpu"
)

The possible different is that the https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py#L1485

Please set the low_cpu_usage_mem=False before installation. According to my tests previously, it can reduce virtual memory sometimes.

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

No worries. Just setting the new conda env and reinstall the requirement.txt and ITREX+NS from the souce code. Theses issues will disappear I think. I have checked the installation pipeline again by using the latest ITREX and NS branch. It works.

Convert:
image

Quant:
image

Inference:
image

image

successful Installation screenshots(Check whether you install successfully)
ITREX:
image

NS:
image

Version:
image

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

I have the same versions as you, yet it gives me the same error: AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

@N3RDIUM N3RDIUM closed this as completed May 17, 2024
@N3RDIUM N3RDIUM reopened this May 17, 2024
@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Oops, did it again, extremely sorry

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

I'm not using conda, just python venv. Does that have something to do with this?

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Here is the error now:

(.venv) .venv ❯ /mnt/code/Code/jarvis/.venv/bin/python /mnt/code/Code/jarvis/llama3.py
_zsh_autosuggest_highlight_reset:3: maximum nested function level reached; increase FUNCNEST?
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-17 13:58:42 [INFO] cpu device is used.
2024-05-17 13:58:42 [INFO] Applying Weight Only Quantization.
2024-05-17 13:58:42 [INFO] Quantize model by Neural Speed with RTN Algorithm.
The model_type: Llama3.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.01it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1526, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1490, in main
    cache_path = Path(tokenizer.vocab_file).parent
                      ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 633, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 205, in init
    convert_model(model_name, fp32_bin, "f32", model_hub=model_hub)
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/__init__.py", line 55, in convert_model
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']' returned non-zero exit status 1.

@N3RDIUM
Copy link
Author

N3RDIUM commented May 17, 2024

Which version of transformers and pytorch are you on?

@Zhenzhong1
Copy link
Collaborator

AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
this error looks about transformers probably.

try this
image

@Ujjawal-K-Panchal
Copy link

Facing the same issue for the given Dockerfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants