01 Mar 13:14

fxmarty

b8b4127

v0.7.1: patch release Latest

Latest

Support loading sharded quantized checkpoints

Sharded checkpoints can now be loaded in the from_quantized method.

Support loading sharded quantized checkpoints. by @LaaZa in #425

Gemma GPTQ quantization

Gemma model can be quantized with AutoGPTQ.

Add support for Gemma models. by @LaaZa in #561

Other changes and fixes

Add back missing import by @fxmarty in #553
Fix bias materialization for Marlin by @fxmarty in #554
Fix shape check marlin by @fxmarty in #557
Explicitely check compute capability in marlin's QLinear by @fxmarty in #567
Compatibility with latest transformers by @fxmarty in #573

Full Changelog: v0.7.0...v0.7.1

Contributors

LaaZa and fxmarty

Assets 2

16 Feb 13:10

fxmarty

v0.7.0

bb4c06f

v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.

This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

add marlin kernel by @qwopqwop200 in #514
updated marlin serialization by @rib-2 in #522
Marlin repacking CUDA kernel by @fxmarty in #539
Marlin kernel can be built against any compute capability by @fxmarty in #540

Ability to load AWQ checkpoints in AutoGPTQ

Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.

AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00,  1.18s/it]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

Support inference with AWQ models by @fxmarty in #484

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

Add qwen2 by @JustinLin610 in #519
Change deci_lm model type to deci by @LaaZa in #491
Support for LongLLaMA models. by @LaaZa in #442

Other changes and bugfixes

Update version & install instructions by @fxmarty in #485
fix the support of Qwen by @hzhwcmhf in #495
rocm6.0 compatible exllama by @seungrokj in #515
Untie weights for safetensors serialization by @fxmarty in #536
marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in #524
Use ruff for linting by @fxmarty in #537
Fix wheels build for torch==2.2.0 by @fxmarty in #541
Fix repo owners in workflows by @fxmarty in #542
Disable peft compatibility by @fxmarty in #543
Improve README by @fxmarty in #544
Add ROCm dockerfile by @fxmarty in #545
Make all tests pass by @fxmarty in #546
Fix cuda wheel build workflows by @fxmarty in #547
Use bash in workflows by @fxmarty in #548
Dissociate Windows & Linux CUDA build by @fxmarty in #549* Add more guards on compute capability in Marlin kernel by @fxmarty in #550

New Contributors

@hzhwcmhf made their first contribution in #495
@rib-2 made their first contribution in #522
@seungrokj made their first contribution in #515

Full Changelog: v0.6.0...v0.7.0

Contributors

hzhwcmhf, LaaZa, and 6 other contributors

Assets 2

15 Dec 06:50

fxmarty

v0.6.0

323950b

v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility

What's Changed

Precise PyTorch version by @fxmarty in #421
Fix triton unexpected keyword by @LaaZa in #423
Add support for Yi models. by @LaaZa in #413
Add support for Xverse models. by @LaaZa in #417
Allow fp32 input to GPTQ linear by @fxmarty in #437
Fix typos in tests by @fxmarty in #438
Update _base.py - Remote (.bin) model load fix by @Shades-en in #465
make build successful on Jetson device(L4T) by @mikeshi80 in #470
Add option to disable qigen at build by @fxmarty in #471
Stop trying to convert a list to int in setup.py when trying to retrieve cores_info by @wemoveon2 in #474
Only make_quant on inside_layer_modules. by @LaaZa in #479
Add support for DeciLM models. by @LaaZa in #481
Support for StableLM Epoch models. by @LaaZa in #444
Add support for Mixtral models. by @LaaZa in #480
Fix compatibility with transformers 4.36 by @fxmarty in #483

New Contributors

@Shades-en made their first contribution in #465
@mikeshi80 made their first contribution in #470
@wemoveon2 made their first contribution in #474

Full Changelog: v0.5.1...v0.6.0

Contributors

mikeshi80, LaaZa, and 3 other contributors

Assets 2

09 Nov 14:55

fxmarty

v0.5.1

f4c9681

v0.5.1: Patch release

Mainly fixes Windows support.

What's Changed

Update README and version following 0.5.0 release by @fxmarty in #397
Fix windows support by @fxmarty in #407
Fix quantize method with None mask by @fxmarty in #408
Improve message about buffer size in exllama v1 backend by @fxmarty in #410
Fix windows (no triton) and cpu-only support by @fxmarty in #411
Fix workflows to use pip instead of conda by @fxmarty in #419

Full Changelog: v0.5.0...v0.5.1

Contributors

fxmarty

Assets 2

02 Nov 22:16

fxmarty

v0.5.0

c70d29f

v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes

Exllama v2 GPTQ kernel support

The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.

A comprehensive benchmark is available here.

exllamav2 integration by @SunMarc in #349

CPU inference support

This is experimental.

Add AutoGPTQ's cpu kernel. by @qwopqwop200 in #245

Loading from safetensors is now the default

Allow using a model with basename model, use_safetensors defaults to True by @fxmarty in #383

Falcon, Mistral support

Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by @TheBloke in #326
Add support for Mistral models. by @LaaZa in #362

Other changes and bugfixes

Fix setuptools classifier by @fxmarty in #285
Update install instructions by @fxmarty in #286
Install skip qigen(windows) by @qwopqwop200 in #309
fix model type changed after calling .to() method by @PanQiWei in #310
Update qwen.py for Qwen-VL by @JustinLin610 in #303
fix typo in max_input_length by @SunMarc in #311
Use adapter_name for get_gptq_peft_model with train_mode=True by @alex4321 in #347
Ignore unknown parameters in quantize_config.json by @z80maniac in #335
fix bug(breaking change) remove (zeors -= 1) by @qwopqwop200 in #325
Revert "fix bug(breaking change) remove (zeors -= 1)" by @PanQiWei in #354
import exllama QuantLinear instead of exllamav2's in pack_model by @PanQiWei in #355
Modify qlinear_cuda for tracing the GPTQ model by @vivekkhandelwal1 in #367
Fix QiGen kernel generation by @fxmarty in #379
Improve RoCm support by @fxmarty in #382
PEFT initialization fix by @alex4321 in #361
Pin to accelerate>=0.22 by @fxmarty in #384
Fix overflow in exllama with act-order by @fxmarty in #386
Default to exllama kernel when exllama v2 is disabled by @fxmarty in #387
Error out on exllama_set_max_input_length call without exllama backend by @fxmarty in #389
Add fix for CPU Inference by @vivekkhandelwal1 in #385
Fix dtype issues and add relevant tests by @fxmarty in #393
Patch accelerate to use correct dtype by @fxmarty in #394
Fixed missing cstdint include by @kodai2199 in #388
Update RoCm workflow to build for RoCm 5.7 by @fxmarty in #395
Fix Windows build by @fxmarty in #396

New Contributors

@JustinLin610 made their first contribution in #303
@SunMarc made their first contribution in #311
@alex4321 made their first contribution in #347
@vivekkhandelwal1 made their first contribution in #367
@kodai2199 made their first contribution in #388

Full Changelog: v0.4.2...v0.5.0

Contributors

TheBloke, alex4321, and 10 other contributors

Assets 2

24 Aug 19:05

fxmarty

v0.4.2

ef442d9

v0.4.2: Patch release

Major bugfix: exllama backend with arbitrary input length

This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:

from auto_gptq import exllama_set_max_input_length

...
model = exllama_set_max_input_length(model, 4096)

Expose a function to update exllama max input length by @fxmarty in #281

Exllama kernels support in Windows wheels

This patch tentatively includes the exllama kernels in the wheels for Windows.

Add PyPI build workflow, tentatively fix exllama on windows by @fxmarty in #282

What's Changed

Build wheels on ubuntu 20.04 by @fxmarty in #272
Free disk space for rocm build by @fxmarty in #273
Use focal for RoCm build by @fxmarty in #274
Use conda incubator for rocm build by @fxmarty in #276
Update install instructions by @fxmarty in #275
Use --extra-index-url to resolve dependencies by @fxmarty in #277
Fix python version for rocm build by @fxmarty in #278
Fix powershell in workflow by @fxmarty in #284

Full Changelog: v0.4.1...v0.4.2

Contributors

fxmarty

Assets 22

13 Aug 09:29

PanQiWei

v0.4.1

eea67b7

v0.4.1: Patch Fix

Overview

Fix typo so not only pytorch==2.0.0 but also pytorch>=2.0.0 can be used for llama fused attention.
Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.

Change Log

What's Changed

Patch exllama QuantLinear to avoid modifying the state dict by @fxmarty in #243

Full Changelog: v0.4.0...v0.4.1

Contributors

fxmarty

Assets 22

09 Aug 11:10

PanQiWei

v0.4.0

69cdfe8

v0.4.0

Overview

New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
New quantization strategy: support to specify static_groups=True on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model.
New model: qwen

Full Change Log

What's Changed

Add RoCm support by @fxmarty in #214
Fix revision used to load the quantization config by @fxmarty in #220
[General Quant Linear] Register quant params of general quant linear for friendly post process. by @LeiWang1999 in #226
Add exllama q4 kernel by @fxmarty in #219
Suppprt static groups and fix bug by @qwopqwop200 in #236
support qwen by @qwopqwop200 in #240

New Contributors

@fxmarty made their first contribution in #214
@LeiWang1999 made their first contribution in #226

Full Changelog: v0.3.2...v0.4.0

Contributors

fxmarty, LeiWang1999, and qwopqwop200

Assets 19

26 Jul 11:25

PanQiWei

v0.3.2

a7167b1

v0.3.2: Patch Fix

Overview

Fix CUDA kernel bug that cause desc_act and group_size can't be used together
Improve user experience of manually installation
Improve user experience of loading quantized model
Add perplexity_utils.py to gracefully calculate PPL so that the result can be used to compare with other libraries fairly
Remove save_dir argument from from_quantized model, and now only model_name_or_path argument is supported in this method

Full Change Log

What's Changed

Fix cuda bug by @qwopqwop200 in #202
Fix revision and other huggingface_hub kwargs in .from_quantized() by @TheBloke in #205
Change the install script so it attempts to build the CUDA extension in all cases by @TheBloke in #206
Add a central version number by @TheBloke in #207
Add Safetensors metadata saving, with some values saved to each .safetensor file by @TheBloke in #208
[FEATURE] Implement perplexity metric to compare against llama.cpp by @casperbh96 in #166
Fix error raised when CUDA kernels are not installed by @PanQiWei in #209
Fix build on non-CUDA machines after #206 by @casperbh96 in #212

New Contributors

@casperbh96 made their first contribution in #166

Full Changelog: v0.3.0...v0.3.2

Contributors

TheBloke, casper-hansen, and 2 other contributors

Assets 14

16 Jul 08:11

PanQiWei

v0.3.0

45576f0

v0.3.0

Overview

CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
New models: BaiChuan, InternLM.
Other updates: see 'Full Change Log' below for details.

Full Change Log

What's Changed

Pytorch qlinear by @qwopqwop200 in #116
Specify UTF-8 encoding for README.md in setup.py by @EliEron in #132
Support cuda 64dim by @qwopqwop200 in #126
Support 32dim by @qwopqwop200 in #125
Peft integration by @PanQiWei in #102
Support setting inject_fused_attention and inject_fused_mlp to False by @TheBloke in #134
Add transpose operator when replace Conv1d with qlinear_cuda_old by @geekinglcq in #140
Add support for BaiChuan model by @LaaZa in #164
Fix error message by @AngainorDev in #141
Add support for InternLM by @cczhong11 in #189
Fix stale documentation by @MarisaKirisame in #158

New Contributors

@EliEron made their first contribution in #132
@geekinglcq made their first contribution in #140
@AngainorDev made their first contribution in #141
@cczhong11 made their first contribution in #189
@MarisaKirisame made their first contribution in #158

Full Changelog: v0.2.1...v0.3.0

Contributors

TheBloke, MarisaKirisame, and 7 other contributors

Assets 16

Releases: AutoGPTQ/AutoGPTQ

v0.7.1: patch release

Support loading sharded quantized checkpoints

Gemma GPTQ quantization

Other changes and fixes

Contributors

v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

Ability to load AWQ checkpoints in AutoGPTQ

Qwen2, LongLLaMA, Deci_lm models support

Other changes and bugfixes

New Contributors

Contributors

v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility

What's Changed

New Contributors

Contributors

v0.5.1: Patch release

What's Changed

Contributors

v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes

Exllama v2 GPTQ kernel support

CPU inference support

Loading from safetensors is now the default

Falcon, Mistral support

Other changes and bugfixes

New Contributors

Contributors

v0.4.2: Patch release

Major bugfix: exllama backend with arbitrary input length

Exllama kernels support in Windows wheels

What's Changed

Contributors

v0.4.1: Patch Fix

Overview

Change Log

What's Changed

Contributors

v0.4.0

Overview

Full Change Log

What's Changed

New Contributors

Contributors

v0.3.2: Patch Fix

Overview

Full Change Log

What's Changed

New Contributors

Contributors

v0.3.0

Overview

Full Change Log

What's Changed

New Contributors

Contributors