llama3 without gpu nor cuda #152

HaShaWB · 2024-04-26T07:36:55Z

I tried creating a CPU-only version of llama3 for a microprocessor. It seems to be working, but the latency is very high, and I frequently encounter blue screen issues on Windows. I'm not sure if this is due to a coding error or a resource issue.

I just modified the code in following files
to upload the file, i changed .py -> .txt
if you want to run this code, you should change the name.

generate-cpu.txt
model-cpu.txt

Papapapapapaya · 2024-05-03T19:51:11Z

Hi, thank you for your work. May I ask what version of transformer you have and how you load the checkpoint? Mine seems to keep reporting torch shape error for checkpoint when using CPU because of GQA.

HaShaWB · 2024-05-05T11:04:11Z

hmm... I just did what it said in the readme.md file. I just downloaded by pip.

Papapapapapaya · 2024-05-06T20:18:52Z

hmm... I just did what it said in the readme.md file. I just downloaded
Could you please confirm which version you have successfully run, 8B or 70B or both?

mawilson1234 · 2024-05-14T01:47:47Z

I've gotten both 8B and 70B (non-chat) running on a CPU. This will probably work for the chat models, but I haven't checked those. You will need at least ~64GB of RAM to run 8B on a CPU, and at least ~320GB of RAM to run 70B, with max_seq_len and max_batch_size set to relatively small values.

Below is the code to load the model and tokenizer, adapted from https://github.com/tloen/llama-int8/blob/main/example.py. There is a small but crucial difference from tloen's code in what's below.

import json
from pathlib import Path

import llama

tokenizer_path = '...' # replace with your local path to tokenizer.model
ckpt_dir = '...' # replace with your local path to the directory containing the model
max_seq_len = 4 # replace with whatever max seq len you want
max_batch_size = 1 # replace with whatever max batch size you want

tokenizer = llama.Tokenizer(model_path=tokenizer_path)
checkpoints = sorted(Path(ckpt_dir).glob('*.pth'))

with open(Path(ckpt_dir)/'params.json', 'r') as f:
    params = json.loads(f.read())

model_args = llama.ModelArgs(
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    **params
)

model_args.vocab_size = tokenizer.n_words
model = llama.Transformer(model_args)

# Original copyright by tloen
# https://github.com/tloen/llama-int8/blob/main/example.py
key_to_dim = {
    "w1": 0,
    "w2": -1,
    "w3": 0,
    "wo": -1,
    "wq": 0,
    "wk": 0,
    "wv": 0,
    "output": 0,
    "tok_embeddings": 0, # This MUST be 0 for Llama 3, unlike LLaMA or Llama 2, which use -1
    "ffn_norm": None,
    "attention_norm": None,
    "norm": None,
    "rope": None,
}

for i, ckpt in enumerate(checkpoints):
    checkpoint = torch.load(ckpt, map_location='cpu')
    for parameter_name, parameter in model.named_parameters():
        short_name = parameter_name.split(".")[-2]
        if key_to_dim[short_name] is None and i == 0:
            parameter.data = checkpoint[parameter_name]
        elif key_to_dim[short_name] == 0:
            size = checkpoint[parameter_name].size(0)
            parameter.data[size * i: size * (i + 1), :] = checkpoint[
                parameter_name
            ]
        elif key_to_dim[short_name] == -1:
            size = checkpoint[parameter_name].size(-1)
            parameter.data[:, size * i: size * (i + 1)] = checkpoint[
                parameter_name
            ]
        del checkpoint[parameter_name]
    del checkpoint

model.to('cpu')

generator = llama.Llama(model, tokenizer)

generator is now your (non-Hugging Face) Llama 3 model!

subramen added the community-discussion label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3 without gpu nor cuda #152

llama3 without gpu nor cuda #152

HaShaWB commented Apr 26, 2024

Papapapapapaya commented May 3, 2024

HaShaWB commented May 5, 2024

Papapapapapaya commented May 6, 2024

mawilson1234 commented May 14, 2024 •

edited

llama3 without gpu nor cuda #152

llama3 without gpu nor cuda #152

Comments

HaShaWB commented Apr 26, 2024

Papapapapapaya commented May 3, 2024

HaShaWB commented May 5, 2024

Papapapapapaya commented May 6, 2024

mawilson1234 commented May 14, 2024 • edited

mawilson1234 commented May 14, 2024 •

edited