New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 without gpu nor cuda #152
Comments
Hi, thank you for your work. May I ask what version of transformer you have and how you load the checkpoint? Mine seems to keep reporting torch shape error for checkpoint when using CPU because of GQA. |
hmm... I just did what it said in the readme.md file. I just downloaded by pip. |
|
I've gotten both 8B and 70B (non-chat) running on a CPU. This will probably work for the chat models, but I haven't checked those. You will need at least ~64GB of RAM to run 8B on a CPU, and at least ~320GB of RAM to run 70B, with Below is the code to load the model and tokenizer, adapted from https://github.com/tloen/llama-int8/blob/main/example.py. There is a small but crucial difference from tloen's code in what's below.
|
I tried creating a CPU-only version of llama3 for a microprocessor. It seems to be working, but the latency is very high, and I frequently encounter blue screen issues on Windows. I'm not sure if this is due to a coding error or a resource issue.
I just modified the code in following files
to upload the file, i changed .py -> .txt
if you want to run this code, you should change the name.
generate-cpu.txt
model-cpu.txt
The text was updated successfully, but these errors were encountered: