You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Hi all, recently when I do experiments with different quantization method, I come through TheBloke repo in Huggingface but can't find my model on his quantized list. That why I decide to quantize myself.
I also looked at #291.
I was use GPTQ to quantize Llama-2-70B model with following setup:
Hardware details
4 x A100 (80GB each) = 320GB VRAM
32 core of CPU.
240 GB RAM
Software version
Cuda 12.2
pip install auto-gptq==0.7.1 --no-build-isolation
pip install transformers==4.38.2
To Reproduce
Steps to reproduce the behavior:
Use following config:
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=32, # 64, 128
damp_percent=0.01,
desc_act=True, # set to False can significantly speed up inference but the perplexity may slightly bad
)
When I monitoring the process with top and nvidia-smi command in Linux:
In the quantization phase, every thing seem normal:
In the beginning, because Llama-2-70b is 140GB in fp16, therefore 4 gpu each recieved 35GB. It seem correct and normal.
Then, each GPU is used consequently depend on which layer is in it. e.g: When it come to the layer in the particular GPU, the usage of this GPU will be high.
This take 4 hours long.
But after it finish with layer 80. I see that it copy all of the weight down to the CPU, I see the RAM usage is increasing slowly to 220GB RAM. It take about 1 hours just for copy from GPU to CPU.
In the end my process is killed with the reason:
...
INFO - Quantizing mlp.down_proj in layer 80/80...
You shouldn't move a model that is dispatched using accelerate hooks.
Killed
Lastly I would like to know the differences with the loading method of autoawq.
For reference, when I quantized with AutoAWQ, it take about 1 hours and its able to run on a instance with 90RAM + 80GB VRAM.
With GPTQ it is much slower (5 hours compare to 1 hours) and 4 time resource requirement VRAM or RAM.
Were you able to fix this? I am getting the same error.
I am using the example script quant_with_alpaca.py, and it looks like the model gets quantized. It even tests the quantized model for inference. However, the model never gets saved and I see this line in the output: "You shouldn't move a model that is dispatched using accelerate hooks."
Describe the bug
Hi all, recently when I do experiments with different quantization method, I come through TheBloke repo in Huggingface but can't find my model on his quantized list. That why I decide to quantize myself.
I also looked at #291.
I was use GPTQ to quantize Llama-2-70B model with following setup:
Hardware details
4 x A100 (80GB each) = 320GB VRAM
32 core of CPU.
240 GB RAM
Software version
Cuda 12.2
To Reproduce
Steps to reproduce the behavior:
Use following config:
And then:
The error happen when
When I monitoring the process with
top
andnvidia-smi
command in Linux:In the quantization phase, every thing seem normal:
In the beginning, because Llama-2-70b is 140GB in fp16, therefore 4 gpu each recieved 35GB. It seem correct and normal.
Then, each GPU is used consequently depend on which layer is in it. e.g: When it come to the layer in the particular GPU, the usage of this GPU will be high.
This take 4 hours long.
But after it finish with layer 80. I see that it copy all of the weight down to the CPU, I see the RAM usage is increasing slowly to 220GB RAM. It take about 1 hours just for copy from GPU to CPU.
In the end my process is killed with the reason:
Lastly I would like to know the differences with the loading method of autoawq.
For reference, when I quantized with AutoAWQ, it take about 1 hours and its able to run on a instance with 90RAM + 80GB VRAM.
With GPTQ it is much slower (5 hours compare to 1 hours) and 4 time resource requirement VRAM or RAM.
The text was updated successfully, but these errors were encountered: