How to speed up the model? #200

VlaTal1 · 2023-07-21T19:21:23Z

VlaTal1
Jul 21, 2023

I'm using the TheBloke/WizardCoder-15B-1.0-GPTQ model and the whole model can fit into the graphics card (3090TI 24GB if that matters), but the model works very slow. A request can be processed for about a minute, although the exact same request is processed by TheBloke/WizardLM-13B-V1.1-GGML model for about 30 seconds. But difference between them not very big to work so slowly.

Is there any way to speed up the model?

TheBloke · 2023-07-21T19:22:29Z

TheBloke
Jul 21, 2023

That's definitely not normal on a 3090Ti. You should be getting 20-30 tokens/s at least.

My first thought is that maybe your AutoGPTQ CUDA extension is not compiled.

Can you please show the full output you see when doing inference, including all messages printed by AutoGPTQ.

9 replies

pascalc Jul 29, 2023

I've set device_map=0 which I believe puts everything on GPU 0 and it still happens.

RAM usage first goes up by ~9GB then drops, then VRAM usage goes up and stays at ~9GB, which I believe is the model passing through RAM to the GPU.

Then CPU usage stays at 100% for a long time, after which the output is coherent, so everything works, but it feels like some CPU operation is bottlenecking the whole thing.

My environment:

Ubuntu 22.04
Nvidia 3090
Python 3.10.12
torch 2.0.1+cu118
triton 2.0.0

AutoGPTQ: I installed this wheel: https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-linux_x86_64.whl

And ran the code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "../llm/gptq/wizardcoder-guanaco-15b"
model_basename = "gptq_model-4bit-128g"

use_triton = True

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        device_map=0,
        trust_remote_code=False,
        use_triton=use_triton,
        quantize_config=None)

prompt = "Tell me about AI"
prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction: {prompt}

### Response:
'''

print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

VlaTal1 Jul 30, 2023
Author

I am nor an expert, but this is from HuggingFace docs:
You can let 🤗 Accelerate handle the device map computation by setting device_map to one of the supported options ("auto", "balanced", "balanced_low_0", "sequential") or create one yourself if you want more control over where each layer should go.

So maybe the problem with your device_map=0?

I had broblems beacause I use use_triton=True and because CUDA extension was not installed. So just try to run it without triton. Maybe you would have some errors

pascalc Jul 30, 2023

You were right, I had to both compile from source and disable triton, then it worked! If you have a mismatch between the cuda version from the auto_gptq wheel and your system it seems to silently "fail" and run everything on your cpu, but compiling from source fixes this and I can add that it works against cuda 12.1.

Thanks for your help!

VlaTal1 Jul 30, 2023
Author

Just in case, make sure you have PyTorch that supports CUDA 12.1. It also can make problems with cuda extention installation

And if you want to use triton, you have to install autoGPTQ like pip install .[triton] in case of souce installation. I have never run with triton, but personally I'm satisfied with the speed without it

pascalc Jul 30, 2023

Okay good to know, thanks!

RichardScottOZ · 2023-07-30T03:46:34Z

RichardScottOZ
Jul 30, 2023

I am seeing something along these lines - using a ubuntu deep learning EC2 - some other models ok, now trying TheBloke/Llama-2-13B-chat-GPTQ.

Generation seemed slow, so then I noticed the CUDA extension not installed error here. Don't get it for a vanilla Falcon 7B or Llama 7B for example. Have cuda toolkit etc. 11.7. Then I noticed via nvidia-smi

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

2 replies

VlaTal1 Jul 30, 2023
Author

As I know, version here is not matter.Try nvcc --version. there is the "right" version

TheBloke Jul 30, 2023

Yes 11.6 is likely causing a problem because AutoGPTQ currently only installs pre-built wheels on 11.7 and 11.8.

Please try building AutoGPTQ from source:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install .

RichardScottOZ · 2023-07-30T03:48:13Z

RichardScottOZ
Jul 30, 2023

GPU usage is around 13GB and says volatile util is 100% when I checked while it was generating, so a little confusing.

0 replies

RichardScottOZ · 2023-07-30T04:31:08Z

RichardScottOZ
Jul 30, 2023

python3 -c 'import torch ; import auto_gptq ; import autogptq_cuda'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'autogptq_cuda'

``

1 reply

TheBloke Jul 30, 2023

Please build AutoGPTQ from source - see my comment to Vlatal above - EDIT: just saw you already did this, good

RichardScottOZ · 2023-07-30T06:11:49Z

RichardScottOZ
Jul 30, 2023

Started another newer machine that has CUDA up to 12.2 and started from scratch

Successfully built auto-gptq
Installing collected packages: auto-gptq
Successfully installed auto-gptq-0.3.2+cu118
(transormers) ubuntu@test:~/AutoGPTQ$ python3 -c 'import torch ; import auto_gptq ; import autogptq_cuda'
(transormers) ubuntu@test:~/AutoGPTQ$

2 replies

TheBloke Jul 30, 2023

Ah good, yes this is all correct.

RichardScottOZ Jul 30, 2023

Yes, the 11.6 issue would have been delving into C++ type compilation fun. Quicker to set u a new machine.

RichardScottOZ · 2023-07-30T06:12:10Z

RichardScottOZ
Jul 30, 2023

I had to install from scratch cloning the repo though, a pip install brand new didn't work

1 reply

TheBloke Jul 30, 2023

Yeah this is a problem at the moment. Unless you have CUDA 11.7 or 11.8, you must git clone and install from source

And for some reason, some users have to install from source even if they do have CUDA 11.7 or 11.8

@PanQiWei has said he will look into it as soon as possible

PanQiWei · 2023-07-30T14:53:14Z

PanQiWei
Jul 30, 2023
Maintainer

I'm working on integrate xformers into auto-gptq(and paged-attention in a very soon future) so the speed will be improved hopefully, also in the next version there should have more pre-build wheels

0 replies

RichardScottOZ · 2023-07-30T22:06:23Z

RichardScottOZ
Jul 30, 2023

I went with 11.8 FYI.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up the model? #200

{{title}}

Replies: 8 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to speed up the model? #200

Replies: 8 comments · 15 replies

VlaTal1 Jul 30, 2023 Author

VlaTal1 Jul 30, 2023 Author

VlaTal1 Jul 30, 2023 Author

PanQiWei Jul 30, 2023 Maintainer

Replies: 8 comments 15 replies

VlaTal1 Jul 30, 2023
Author

VlaTal1 Jul 30, 2023
Author

VlaTal1 Jul 30, 2023
Author

PanQiWei
Jul 30, 2023
Maintainer