-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [WORKING] dbrx (mod) support #625
base: main
Are you sure you want to change the base?
Conversation
I don't think norm_1 and 2 or router helps here. The MoE achitechture used is very unusual. https://huggingface.co/databricks/dbrx-instruct/discussions/10 |
@LaaZa still trying to get quant to complete. |
status update: the quantize() stage is currently skipping over the following 3 important layers:
and I am almost out of time today to work on this. |
You need to have index for the experts if you use the modified model. For now, you need to duplicate the mlp lines for each from 0-15. Also the correct order is [w1, v1], [w2] ffn.experts.mlp.0.w1 |
@LaaZa Thanks for the fix. It's now slowly going down the layers. Hopefully this does the trick until the next bug. |
@LaaZa Btw, why is quantizing norm1, norm2, and router layers not helpful? I have little experience in the model layer code to infer the reason. Thanks. If we exclude the norm1, norm2 layers, are their weights retained in full float in the quantized weights? If this is the cause, I might want to test weeding out Wqkv layer too due to the massive 300+ loss values on almost every layer. |
Well in general I would skip anything with wrong shape. Normalization modules are usually skipped and in this model they have the shape [6144] so it's one dimensional and we need both infeatures and outfeatures to be divisible by 32. router.layer has the shape [16, 6144] so the outfeatures are too small. |
Thanks. That make perfect sense. I hacked the exllama (v1) code which is used in quant to pad to the correct outfeatures. autogptq exllama v2 had padding (but likely broken due to non-assignment #626) so maybe it will work? We shall see. Anyways testing 4 quant tasks with various layers disabled at the same time to see if we hit the pot at the end of the rainbow. |
I wonder if we split Wqkv layer would help resolve the problem with massive loss |
So far quant is stuck on packing. It doesn't error but just take 1 core 100% cpu in an apparent loop on something. Going to stop testing until there is more updates on the model. |
@fahadh4ilyas Will restart and test quantization based on your new split Wqkv layers. |
Confirmed v2 converted has sane/normal losses for the split q,k,v layers! Waiting it for to finish now to check if inf is ok. |
Update: We have a problem. Packing of the v1, w1, w2 layers are extremely slow:
|
I have started a second quant with norm_1/2 + router removed. Will test inference on both to validate. first one with all layers is 70% complete. |
Test quant for all layers (including norm_1/2 + router) finished quantize stage but again stuck on super slow packing:
I have never quantized such a large model and unsure if the massive slow down in pack is something normal. The problem is not quantizing, but packing at the moment. |
@LaaZa @fahadh4ilyas @fxmarty @qwopqwop200 The current quant progress of dbrx-base-converted-v2 is hitting a roadblock off ungodly slow
|
Take a look at #439 if that helps. It's going to be slow though as it happens on the cpu. |
@Xu-Chen Please help by testing quantizing using dbrx-base-converted-v2. You only need 1 GPU for quantize stage but make sure cpu is not is used by others since the packing stage is pure cpu. my current test script: import os
max_cpu_threads = "24" # change this to 1/2 of number of cores show in os
os.environ["OMP_NUM_THREADS"] = max_cpu_threads
os.environ["OPENBLAS_NUM_THREADS"] = max_cpu_threads
os.environ["MKL_NUM_THREADS"] = max_cpu_threads
os.environ["VECLIB_MAXIMUM_THREADS"] = max_cpu_threads
os.environ["NUMEXPR_NUM_THREADS"] = max_cpu_threads
os.environ["NUMEXPR_MAX_THREADS"] = max_cpu_threads
import numpy as np
import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S",
)
pretrained_model_dir = "/monster/data/model/dbrx-base-converted-v2/"
quantized_model_dir = os.path.join(pretrained_model_dir, "quant")
quantized_model_dir = os.path.join(quantized_model_dir, "4bit-v10")
os.makedirs(quantized_model_dir, exist_ok=True)
print("pretrained_model_dir", pretrained_model_dir)
print("quantized_model_dir", quantized_model_dir)
def get_wikitext2(nsamples, seed, seqlen, model):
from datasets import load_dataset
traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
trainenc = tokenizer("\n\n".join(traindata["text"]), return_tensors="pt")
testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
import random
random.seed(seed)
np.random.seed(0)
torch.random.manual_seed(0)
traindataset = []
for _ in range(nsamples):
i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
inp = trainenc.input_ids[:, i:j]
attention_mask = torch.ones_like(inp)
traindataset.append({"input_ids": inp, "attention_mask": attention_mask})
tokenizer.save_pretrained(quantized_model_dir)
return traindataset, testenc
traindataset, testenc = get_wikitext2(128, 0, 2048, pretrained_model_dir)
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # desc_act and group size only works on triton
damp_percent=0.005,
quant_method="gptq",
checkpoint_format="gptq",
)
# load un-quantized model, the model will always be force loaded into cpu
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True, device_map="auto")
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
# with value under torch.LongTensor type.
model.quantize(traindataset, use_triton=False)
# save quantized model using safetensors
model.save_quantized(quantized_model_dir)
print(quantized_model_dir) |
Thank you , i will try. The following is the log of dbrx-instruct-converted-v2. Downloading dbrx-base-converted-v2. |
Success! I will push quantized model for you guys to test. It will need about 68GB of vram to run so 1xA100 80GB will do. EDIT: now the issue is that HF has limit of 50GB file each. The quant tensor is 67GB UPDATE: uploading...into 2 split file.. with script/shell commands to recombine them https://huggingface.co/LnL-AI/dbrx-base-converted-v2-4bit-gptq-marlin |
Thank you. I use convert_v2.py to convert dbrx-instruct as dbrx-instruct-converted-v2. But the avg loss is small at the beginning , gradually increases, as shown below. Is there any problem?
|
Yes. The increase of loss in as layers increase is also observed in my end. For now, we don't have much to go on. Calibration may need to be tweaked to optimize for dbrx. Right now generation is different than bfloat16 in my limited testing but still coherent. However, there may be EOS problem where it is not stopping. These are minor issues as more users start to quant and find the quirks. |
We need to add a feature to save_quantized into multiple files with |
@LaaZa @fahadh4ilyas @Xu-Chen If you have the vram, please test the 4bit marlin inference with https://huggingface.co/LnL-AI/dbrx-base-converted-v2-4bit-gptq-marlin and let me if you are getting coherent responses. Note the loading time is quite long. |
I think the calibration code is broken for MOE style dbrx. Based on the massive escalation of errors of error loss on as we progress through the layers 1-40.
|
@Qubitium My current test script:
|
Nice! Can you post some of the layers losses from beginning/middle/end? To confirm, you made two changes:
Are you using the above two values? also use cores / 3 for max threads for max pack performance. 1/2 may still lead to slow down |
@LaaZa Already made this shard loading code. I just need to shared on save. |
We need to reimplement #364 saving works but it is obviously outdated and there was a weird issue where sharded saving with it breaks fused. |
Yes,and I used samples from HuggingFaceH4/ultrachat_200k and applied apply_chat_template,while the model is dbrx-instruct-converted-v2 (use convert_v2.py to convert dbrx-instruct as dbrx-instruct-converted-v2). The log has been overwritten by new output, but the final avg loss is less than 2. Use 4096 samples, here is some log, but too slow and oom. |
Two new quants have started based on calibration fix from @Xu-Chen . Marlin + Non-Marlin |
The next step should be to deploy the model using vllm |
Our vllm PR for dbrx-converted-v2 should be ready soon. |
Unfortunately dbrx finetuning has failed my internal quality metrics so I will not be spending time to port dbrx-converted to vllm. But if you plan to do so there are two paths: 1) reverse the convert_v2.py splits so existing works or 2) use the new inf code but load the weights 1/tp_size split and keep the tp_rank slice. |
Sorry for the delay, I have been busy.
I never had a look at the implementation, it is probably suboptimal. Either I need to implement a CUDA kernel for packing, or there's a better implementation possible in PyTorch. Let me have a look next week. |
Dbrx proper support is finally merged to transformers: huggingface/transformers#29921 The whole quant code path needs to be visited again. Maybe they fixed the two issues that plagued the pre-transformer merge: 1) training is not possible due to oom of fused layers 2) quant is not possible/feasible due the fused layers. I no longer have the time to work on this. If you want to take over this WIP PR, you can fork it or let me know and I will add your to the branch permissions. |
Well, #623 is meant to support the unmodified model if possible. I was waiting for resolution on the transformers implementation to see if it will be possible then. Do you think the modified version of the model is still needed? |
Attemping to hack 4bit quant using modfied model code from https://huggingface.co/databricks/dbrx-instruct/discussions/10 written by https://huggingface.co/fahadh4ilyas
As the title implies: this is pure HACK/mod using a converted model with different layout of weights:
This PR (now working!) requires a converted v2 model below:
dbrx-base (databricks original)
model: https://huggingface.co/databrickks/dbrx-base
dbrx-base-converted v2
model: https://huggingface.co/LnL-AI/dbrx-base-converted-v2
converted-v2 4bit quants:
Quant Script: