Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 without gpu nor cuda #152

Open
HaShaWB opened this issue Apr 26, 2024 · 4 comments
Open

llama3 without gpu nor cuda #152

HaShaWB opened this issue Apr 26, 2024 · 4 comments

Comments

@HaShaWB
Copy link

HaShaWB commented Apr 26, 2024

I tried creating a CPU-only version of llama3 for a microprocessor. It seems to be working, but the latency is very high, and I frequently encounter blue screen issues on Windows. I'm not sure if this is due to a coding error or a resource issue.

I just modified the code in following files
to upload the file, i changed .py -> .txt
if you want to run this code, you should change the name.

generate-cpu.txt
model-cpu.txt

@Papapapapapaya
Copy link

Hi, thank you for your work. May I ask what version of transformer you have and how you load the checkpoint? Mine seems to keep reporting torch shape error for checkpoint when using CPU because of GQA.

@HaShaWB
Copy link
Author

HaShaWB commented May 5, 2024

hmm... I just did what it said in the readme.md file. I just downloaded by pip.

@Papapapapapaya
Copy link

hmm... I just did what it said in the readme.md file. I just downloaded
Could you please confirm which version you have successfully run, 8B or 70B or both?

@mawilson1234
Copy link

mawilson1234 commented May 14, 2024

I've gotten both 8B and 70B (non-chat) running on a CPU. This will probably work for the chat models, but I haven't checked those. You will need at least ~64GB of RAM to run 8B on a CPU, and at least ~320GB of RAM to run 70B, with max_seq_len and max_batch_size set to relatively small values.

Below is the code to load the model and tokenizer, adapted from https://github.com/tloen/llama-int8/blob/main/example.py. There is a small but crucial difference from tloen's code in what's below.

import json
from pathlib import Path

import llama

tokenizer_path = '...' # replace with your local path to tokenizer.model
ckpt_dir = '...' # replace with your local path to the directory containing the model
max_seq_len = 4 # replace with whatever max seq len you want
max_batch_size = 1 # replace with whatever max batch size you want

tokenizer = llama.Tokenizer(model_path=tokenizer_path)
checkpoints = sorted(Path(ckpt_dir).glob('*.pth'))

with open(Path(ckpt_dir)/'params.json', 'r') as f:
    params = json.loads(f.read())

model_args = llama.ModelArgs(
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    **params
)

model_args.vocab_size = tokenizer.n_words
model = llama.Transformer(model_args)

# Original copyright by tloen
# https://github.com/tloen/llama-int8/blob/main/example.py
key_to_dim = {
    "w1": 0,
    "w2": -1,
    "w3": 0,
    "wo": -1,
    "wq": 0,
    "wk": 0,
    "wv": 0,
    "output": 0,
    "tok_embeddings": 0, # This MUST be 0 for Llama 3, unlike LLaMA or Llama 2, which use -1
    "ffn_norm": None,
    "attention_norm": None,
    "norm": None,
    "rope": None,
}

for i, ckpt in enumerate(checkpoints):
    checkpoint = torch.load(ckpt, map_location='cpu')
    for parameter_name, parameter in model.named_parameters():
        short_name = parameter_name.split(".")[-2]
        if key_to_dim[short_name] is None and i == 0:
            parameter.data = checkpoint[parameter_name]
        elif key_to_dim[short_name] == 0:
            size = checkpoint[parameter_name].size(0)
            parameter.data[size * i: size * (i + 1), :] = checkpoint[
                parameter_name
            ]
        elif key_to_dim[short_name] == -1:
            size = checkpoint[parameter_name].size(-1)
            parameter.data[:, size * i: size * (i + 1)] = checkpoint[
                parameter_name
            ]
        del checkpoint[parameter_name]
    del checkpoint

model.to('cpu')

generator = llama.Llama(model, tokenizer)

generator is now your (non-Hugging Face) Llama 3 model!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants