Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to quantize the JAIS model. #632

Open
Mohammad-Faris opened this issue Apr 4, 2024 · 10 comments
Open

Error when trying to quantize the JAIS model. #632

Mohammad-Faris opened this issue Apr 4, 2024 · 10 comments

Comments

@Mohammad-Faris
Copy link

Mohammad-Faris commented Apr 4, 2024

I'm trying to quantize the JAIS model, but I received the message TypeError: JAIS isn't supported yet. This is my code:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "/sdb-disk/LlmsModels/jais-30b-chat-v3"
quantized_model_dir = "/sdb-disk/LlmsModels/jais-30b-chat-v3-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True)

model.quantize(examples)

Is there a way to quantize it? I followed the README's Customize Model section, but it's also not working

@LaaZa
Copy link
Contributor

LaaZa commented Apr 6, 2024

Without modifying AutoGPTQ code you can try this:

import torch, auto_gptq
from transformers import AutoModel, AutoTokenizer 
from auto_gptq.modeling._base import BaseGPTQForCausalLM
import logging

logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

# define model
auto_gptq.modeling._base.SUPPORTED_MODELS = ["jais"]
    
class JAISLMHeadModelGPTQ(BaseGPTQForCausalLM):
    layer_type = "JAISBlock"
    layers_block_name = "transformer.h"
    outside_layer_modules = ["transformer.ln_f", "transformer.relative_pe", "transformer.wte"]
    inside_layer_modules = [
        ["attn.c_attn"],
        ["attn.c_proj"],
        ["mlp.c_fc", "mlp.c_fc2"],
        ["mlp.c_proj"],
    ]
#############


pretrained_model_dir = "/sdb-disk/LlmsModels/jais-30b-chat-v3"
quantized_model_dir = "/sdb-disk/LlmsModels/jais-30b-chat-v3-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, trust_remote_code=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]

quantize_config = BaseQuantizeConfig(
    bits=4, # quantize model to 4-bit
    group_size=128, # it is recommended to set the value to 128
    desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)

model = JAISLMHeadModelGPTQ.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True)

model.quantize(examples)

model.save_quantized(quantized_model_dir, use_safetensors=True)

I have not tested this. To run the quantized model you need the model definition code in that script too. For serious quantization you should use a proper dataset that aligns with the model training better.

@Mohammad-Faris
Copy link
Author

Thank you @LaaZa for reaching out. I'm following your code. Finally, it's starting to quantize, but in the final step "Packing model...", I got this error: AssertionError.

INFO - Quantizing mlp.c_proj in layer 40/40...
2024-04-06 12:34:11 INFO [auto_gptq.quantization.gptq] duration: 6.95334005355835
2024-04-06 12:34:11 INFO [auto_gptq.quantization.gptq] avg loss: 731976.4375
INFO - Packing model...
2024-04-06 12:34:12 INFO [auto_gptq.modeling._utils] Packing model...
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 46
     37 quantize_config = BaseQuantizeConfig(
     38     damp_percent=0.1,
     39     bits=4, # quantize model to 4-bit
     40     group_size=128, # it is recommended to set the value to 128
     41     desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
     42 )
     44 model = JAISLMHeadModelGPTQ.from_pretrained(pretrained_model_dir, quantize_config, torch_dtype=torch.float16, trust_remote_code=True)
---> 46 model.quantize(examples)
     48 model.save_quantized(quantized_model_dir, use_safetensors=True)

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /sdb-disk/AutoGPTQ/auto_gptq/modeling/_base.py:377, in BaseGPTQForCausalLM.quantize(self, examples, batch_size, use_triton, use_cuda_fp16, autotune_warmup_after_quantized, cache_examples_on_gpu)
    374     layer_inputs, layer_outputs = layer_outputs, []  # TODO: is it really OK to cache only the first positional argument?
    375     torch.cuda.empty_cache()
--> 377 pack_model(
    378     model=self.model,
    379     quantizers=quantizers,
    380     bits=self.quantize_config.bits,
    381     group_size=self.quantize_config.group_size,
    382     use_triton=use_triton,
    383     use_cuda_fp16=use_cuda_fp16,
    384     desc_act=self.quantize_config.desc_act,
    385     warmup_triton=autotune_warmup_after_quantized,
    386     force_layer_back_to_cpu=force_layer_back_to_cpu,
    387     use_marlin=self.quantize_config.checkpoint_format == CHECKPOINT_FORMAT.MARLIN,
    388 )
    389 if device_map:
    390     self.model = remove_hook_from_module(self.model, recurse=True)

File /sdb-disk/AutoGPTQ/auto_gptq/modeling/_utils.py:286, in pack_model(model, quantizers, bits, group_size, use_triton, use_cuda_fp16, desc_act, warmup_triton, force_layer_back_to_cpu, use_marlin, use_tritonv2)
    284 layers = find_layers(model)
    285 layers = {n: layers[n] for n in quantizers}
--> 286 make_quant(
    287     model,
    288     quantizers,
    289     bits,
    290     group_size,
    291     use_triton=use_triton,
    292     use_cuda_fp16=use_cuda_fp16,
    293     desc_act=desc_act,
    294     disable_exllama=False,
    295     disable_exllamav2=True,
    296     use_marlin=use_marlin,
    297 )
    298 qlayers = find_layers(model, [QuantLinear])
    300 pbar = tqdm(qlayers.keys(), leave=True)

File /sdb-disk/AutoGPTQ/auto_gptq/modeling/_utils.py:126, in make_quant(module, names, bits, group_size, name, use_triton, use_marlin, disable_exllama, disable_exllamav2, use_qigen, use_cuda_fp16, desc_act, trainable, use_tritonv2)
    119 bias = submodule.bias is not None
    120 if (
    121     (not (desc_act) or group_size == -1)
    122     and not use_triton
    123     and not use_qigen
    124     and not use_tritonv2
    125 ):
--> 126     new_layer = QuantLinear(
    127         bits,
    128         group_size,
    129         in_features,
    130         out_features,
    131         bias,
    132         use_cuda_fp16=use_cuda_fp16,
    133         trainable=trainable,
    134         weight_dtype=submodule.weight.dtype,
    135     )
    136 else:
    137     new_layer = QuantLinear(
    138         bits,
    139         group_size,
   (...)
    144         weight_dtype=submodule.weight.dtype,
    145     )

File /sdb-disk/AutoGPTQ/auto_gptq/nn_modules/qlinear/qlinear_exllama.py:71, in QuantLinear.__init__(self, bits, group_size, infeatures, outfeatures, bias, trainable, **kwargs)
     68 self.trainable = trainable
     69 self.maxq = 2**self.bits - 1
---> 71 assert infeatures % 32 == 0
     72 assert infeatures % self.group_size == 0
     73 assert outfeatures % 32 == 0

AssertionError: 

@LaaZa
Copy link
Contributor

LaaZa commented Apr 6, 2024

Okay, I can't see the module shapes because the model is not in safetensors format. Try to update to the very latest auto_gptq from git, it should have a fix for the padding which might fix the issue for you. However that avg loss looks really bad, might be your limited examples tho.

@Mohammad-Faris
Copy link
Author

Mohammad-Faris commented Apr 6, 2024

@LaaZa I built auto_gptq directly from the source using:

pip install -vvv --no-build-isolation -e .

This is the version of auto_gptq: 0.8.0.dev0, and for transformers: 4.39.3.

These are all the packages installed:

accelerate==0.28.0
aiohttp==3.9.3
aiosignal==1.3.1
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
async-timeout==4.0.3
attrs==23.2.0
**-e git+https://github.com/AutoGPTQ/AutoGPTQ.git@b4b801c6d37cbd210a2f36579fc09d2915b72f22#egg=auto_gptq
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work**
bitsandbytes==0.43.0
certifi==2024.2.2
charset-normalizer==3.3.2
coloredlogs==15.0.1
comm @ file:///croot/comm_1671231121260/work
datasets==2.18.0
debugpy @ file:///croot/debugpy_1690905042057/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
dill==0.3.8
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.2.0
gekko==1.1.0
huggingface-hub==0.22.2
humanfriendly==10.0
idna==3.6
importlib-metadata @ file:///croot/importlib-metadata_1678997070253/work
ipykernel @ file:///croot/ipykernel_1691121631942/work
ipython @ file:///croot/ipython_1691532092695/work
jedi @ file:///tmp/build/80754af9/jedi_1644315233700/work
Jinja2==3.1.3
jupyter_client @ file:///croot/jupyter_client_1699455897726/work
jupyter_core @ file:///croot/jupyter_core_1698937308754/work
MarkupSafe==2.1.5
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
nest-asyncio @ file:///croot/nest-asyncio_1672387112409/work
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
**optimum==1.16.0**
packaging @ file:///croot/packaging_1693575174725/work
pandas==2.0.3
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
peft==0.10.0
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
pillow==10.3.0
platformdirs @ file:///croot/platformdirs_1692205439124/work
prompt-toolkit @ file:///croot/prompt-toolkit_1672387306916/work
protobuf==5.26.1
psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pyarrow==15.0.2
pyarrow-hotfix==0.6
Pygments @ file:///croot/pygments_1684279966437/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
pytz==2024.1
PyYAML==6.0.1
pyzmq @ file:///croot/pyzmq_1686601365461/work
regex==2023.12.25
requests==2.31.0
rouge==1.0.1
safetensors==0.4.2
sentencepiece==0.2.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
sympy==1.12
tokenizers==0.15.2
torch==2.2.2
torchvision==0.17.2
tornado @ file:///croot/tornado_1696936946304/work
tqdm==4.66.2
traitlets @ file:///croot/traitlets_1671143879854/work
**transformers==4.39.3**
triton==2.2.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
xxhash==3.4.1
yarl==1.9.4
zipp @ file:///croot/zipp_1672387121353/work

Regarding the average loss, yes, I set a few examples, which is why it's very bad. However, I don't think this is the main issue. I will switch to real data once I'm sure the quantization is working fine.

@LaaZa
Copy link
Contributor

LaaZa commented Apr 6, 2024

Do you have commit b4b801c

@Mohammad-Faris
Copy link
Author

@LaaZa Yes, this snippet is from my qlinear_exllama.py.

Screenshot 2024-04-06 at 4 08 55 PM

@LaaZa
Copy link
Contributor

LaaZa commented Apr 6, 2024

Oh, I think it only affected outfeatures. Some of the modules do not have infeatures divisible by 32. fc and fc2 it seems like, but that's pretty bad because they are huge and should be quantized.

you can try removing ["mlp.c_fc", "mlp.c_fc2"], but the quantization is going to be limited.

@Mohammad-Faris
Copy link
Author

Thank you, @LaaZa . I tried to remove them, but it's not working. When I set this it works fine.

class JAISLMHeadModelGPTQ(BaseGPTQForCausalLM):
    layer_type = "JAISBlock"
    layers_block_name = "transformer.h"
    outside_layer_modules = ["transformer.ln_f", "transformer.relative_pe", "transformer.wte"]
    inside_layer_modules = [
        ["attn.c_attn"],
        # ["attn.c_proj"],
        # ["mlp.c_fc", "mlp.c_fc2"],
        # ["mlp.c_proj"],
    ]

However, I still have an issue using it with vllm. Vllm already supports the JAIS model, and when I try to use the original model, it works fine. But when I use the new quantized model, it fails.

This is the code:

from vllm import LLM, SamplingParams
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

user_input = "طفلة عمرها سنة زكمت قبل بخار ومشروبات دافية ولا خفت"

conversation_history = """
    ### Instruction: استخرج 10 اسئله طبيه يمكن ان اسألها للمريض هنا تساعدني في تشخيص حالته اكثر
    ### Input: [|Human|] {user_input} 
    ### Response: [|AI|]
"""

conversation_history = conversation_history.format(user_input=user_input)

prompts = []
prompts.append(conversation_history)

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="/sdb-disk/LlmsModels/jais-13b-chat-4bit", trust_remote_code=True, 
          tensor_parallel_size=1, dtype='float16',
          enforce_eager=False, gpu_memory_utilization=0.9, max_model_len=2048, quantization='gptq')

And this is the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 22
     17 prompts = [
     18 ].append(conversation_history)
     20 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
---> 22 llm = LLM(model="/sdb-disk/LlmsModels/jais-13b-chat-4bit", trust_remote_code=True, 
     23           tensor_parallel_size=1,
     24           enforce_eager=False
     25           ,gpu_memory_utilization=0.9 ,max_model_len=2048, quantization='gptq')

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/entrypoints/llm.py:112, in LLM.__init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
     93     kwargs["disable_log_stats"] = True
     94 engine_args = EngineArgs(
     95     model=model,
     96     tokenizer=tokenizer,
   (...)
    110     **kwargs,
    111 )
--> 112 self.llm_engine = LLMEngine.from_engine_args(
    113     engine_args, usage_context=UsageContext.LLM_CLASS)
    114 self.request_counter = Counter()

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/engine/llm_engine.py:196, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    193     executor_class = GPUExecutor
    195 # Create the LLM engine.
--> 196 engine = cls(
    197     *engine_configs,
    198     executor_class=executor_class,
    199     log_stats=not engine_args.disable_log_stats,
    200     usage_context=usage_context,
    201 )
    202 return engine

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/engine/llm_engine.py:110, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, vision_language_config, executor_class, log_stats, usage_context)
    107 self.detokenizer = Detokenizer(self.tokenizer)
    108 self.seq_counter = Counter()
--> 110 self.model_executor = executor_class(model_config, cache_config,
    111                                      parallel_config, scheduler_config,
    112                                      device_config, lora_config,
    113                                      vision_language_config)
    115 # If usage stat is enabled, collect relevant info.
    116 if is_usage_stats_enabled():

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/executor/gpu_executor.py:37, in GPUExecutor.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, vision_language_config)
     34 self.vision_language_config = vision_language_config
     36 # Instantiate the worker and load the model to GPU.
---> 37 self._init_worker()
     39 # Profile the memory usage and initialize the cache.
     40 self._init_cache()

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/executor/gpu_executor.py:66, in GPUExecutor._init_worker(self)
     52 self.driver_worker = Worker(
     53     self.model_config,
     54     self.parallel_config,
   (...)
     63     is_driver_worker=True,
     64 )
     65 self.driver_worker.init_device()
---> 66 self.driver_worker.load_model()

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/worker/worker.py:107, in Worker.load_model(self)
    106 def load_model(self):
--> 107     self.model_runner.load_model()

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/worker/model_runner.py:95, in ModelRunner.load_model(self)
     93 def load_model(self) -> None:
     94     with CudaMemoryProfiler() as m:
---> 95         self.model = get_model(
     96             self.model_config,
     97             self.device_config,
     98             lora_config=self.lora_config,
     99             vision_language_config=self.vision_language_config,
    100             parallel_config=self.parallel_config,
    101             scheduler_config=self.scheduler_config)
    103     self.model_memory_usage = m.consumed_memory
    104     logger.info(f"Loading model weights took "
    105                 f"{self.model_memory_usage / float(2**30):.4f} GB")

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/model_loader.py:91, in get_model(model_config, device_config, **kwargs)
     89 else:
     90     if model_class not in _VISION_MODEL_CLASSES:
---> 91         model = model_class(model_config.hf_config, linear_method)
     92     else:
     93         model = model_class(model_config.hf_config,
     94                             vision_language_config, linear_method)

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/models/jais.py:270, in JAISLMHeadModel.__init__(self, config, linear_method)
    268 self.config = config
    269 self.linear_method = linear_method
--> 270 self.transformer = JAISModel(config, linear_method)
    271 self.lm_head_weight = self.transformer.wte.weight
    272 if hasattr(config, "width_scale"):

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/models/jais.py:230, in JAISModel.__init__(self, config, linear_method)
    228 else:
    229     self.embeddings_scale = config.mup_embeddings_scale
--> 230 self.h = nn.ModuleList([
    231     JAISBlock(config, linear_method)
    232     for _ in range(config.num_hidden_layers)
    233 ])
    234 self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/models/jais.py:231, in <listcomp>(.0)
    228 else:
    229     self.embeddings_scale = config.mup_embeddings_scale
    230 self.h = nn.ModuleList([
--> 231     JAISBlock(config, linear_method)
    232     for _ in range(config.num_hidden_layers)
    233 ])
    234 self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/models/jais.py:183, in JAISBlock.__init__(self, config, linear_method)
    181 self.attn = JAISAttention(config, linear_method)
    182 self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
--> 183 self.mlp = JAISMLP(inner_dim, config, linear_method)

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/models/jais.py:137, in JAISMLP.__init__(self, intermediate_size, config, linear_method)
    135 hidden_size = config.hidden_size
    136 self.swiglu = config.activation_function == "swiglu"
--> 137 self.c_fc = ColumnParallelLinear(
    138     hidden_size,
    139     intermediate_size,
    140     bias=True,
    141     linear_method=linear_method,
    142 )
    143 self.c_fc2 = (ColumnParallelLinear(
    144     hidden_size,
    145     intermediate_size,
    146     bias=True,
    147     linear_method=linear_method,
    148 ) if self.swiglu else None)
    149 self.c_proj = RowParallelLinear(
    150     intermediate_size,
    151     hidden_size,
    152     bias=True,
    153     linear_method=linear_method,
    154 )

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py:181, in ColumnParallelLinear.__init__(self, input_size, output_size, bias, gather_output, skip_bias_add, params_dtype, linear_method)
    179     linear_method = UnquantizedLinearMethod()
    180 self.linear_method = linear_method
--> 181 self.linear_weights = self.linear_method.create_weights(
    182     self.input_size, self.output_size_per_partition, self.input_size,
    183     self.output_size, self.params_dtype)
    184 for name, weight in self.linear_weights.items():
    185     if isinstance(weight, torch.Tensor):

File ~/anaconda3/envs/testquant/lib/python3.8/site-packages/vllm/model_executor/layers/quantization/gptq.py:106, in GPTQLinearMethod.create_weights(***failed resolving arguments***)
    100     raise ValueError(
    101         "The input size is not aligned with the quantized "
    102         "weight shape. This can be caused by too large "
    103         "tensor parallel size.")
    104 if (output_size_per_partition % self.quant_config.pack_factor.numerator
    105         != 0):
--> 106     raise ValueError(
    107         "The output size is not aligned with the quantized "
    108         "weight shape. This can be caused by too large "
    109         "tensor parallel size.")
    111 if self.quant_config.group_size != -1:
    112     group_size = self.quant_config.group_size

ValueError: The output size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

ValueError: The output size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

@LaaZa
Copy link
Contributor

LaaZa commented Apr 6, 2024

You are now quantizing only a very small portion of the model making it almost pointless.
The model seems to have a lot of issues especially with quantization https://huggingface.co/haouarin/jais-13b-chat-GPTQ-4bits

vllm issue happens in their code so I can't help with that, obviously the model isn't very compatible with quantization.

I have to give up. The model is ultimately quite niche and the devs have not worked to solve these issues or get it implemented in transformers yet.

@Mohammad-Faris
Copy link
Author

Thank you, @LaaZa , for your effort. I really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants