Replies: 8 comments 15 replies
-
That's definitely not normal on a 3090Ti. You should be getting 20-30 tokens/s at least. My first thought is that maybe your AutoGPTQ CUDA extension is not compiled. Can you please show the full output you see when doing inference, including all messages printed by AutoGPTQ. |
Beta Was this translation helpful? Give feedback.
-
I am seeing something along these lines - using a ubuntu deep learning EC2 - some other models ok, now trying TheBloke/Llama-2-13B-chat-GPTQ. Generation seemed slow, so then I noticed the CUDA extension not installed error here. Don't get it for a vanilla Falcon 7B or Llama 7B for example. Have cuda toolkit etc. 11.7. Then I noticed via nvidia-smi Is it possible that this being 11.6 is a problem ? ````bas +-----------------------------------------------------------------------------+
|
Beta Was this translation helpful? Give feedback.
-
GPU usage is around 13GB and says volatile util is 100% when I checked while it was generating, so a little confusing. |
Beta Was this translation helpful? Give feedback.
-
python3 -c 'import torch ; import auto_gptq ; import autogptq_cuda'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'autogptq_cuda'
`` |
Beta Was this translation helpful? Give feedback.
-
Started another newer machine that has CUDA up to 12.2 and started from scratch Successfully built auto-gptq
Installing collected packages: auto-gptq
Successfully installed auto-gptq-0.3.2+cu118
(transormers) ubuntu@test:~/AutoGPTQ$ python3 -c 'import torch ; import auto_gptq ; import autogptq_cuda'
(transormers) ubuntu@test:~/AutoGPTQ$ |
Beta Was this translation helpful? Give feedback.
-
I had to install from scratch cloning the repo though, a pip install brand new didn't work |
Beta Was this translation helpful? Give feedback.
-
I'm working on integrate xformers into auto-gptq(and paged-attention in a very soon future) so the speed will be improved hopefully, also in the next version there should have more pre-build wheels |
Beta Was this translation helpful? Give feedback.
-
I went with 11.8 FYI. |
Beta Was this translation helpful? Give feedback.
-
I'm using the TheBloke/WizardCoder-15B-1.0-GPTQ model and the whole model can fit into the graphics card (3090TI 24GB if that matters), but the model works very slow. A request can be processed for about a minute, although the exact same request is processed by TheBloke/WizardLM-13B-V1.1-GGML model for about 30 seconds. But difference between them not very big to work so slowly.
Is there any way to speed up the model?
Beta Was this translation helpful? Give feedback.
All reactions