Replies: 2 comments 4 replies
-
I don't really get the point here: a Q4 param for example, can be stored using 4 bits. Now with this format, each param is store with its ID, meaning a model having 7B params will need to have Also storing like this will come with a cost of having to copy data before each matrix multiplications. Yeah you can kinda "decompress" on loading the model into memory, but that defeats the goal of quantizing: to make the model fit into GPU having low amount VRAM. Also in llama.cpp, we're currently store quantized params in buckets, it's called a block. Each block has its scale, that makes Q4_K so effective. See: #397
So that mean dequantize Q4 to Q6 or Q8 right? AFAIK dequantizing will come with performance cost. Most backends implement kernels to do calculations directly with quantized values. There's always trade-off among model size - performance - quality.. It's better to check with @ikawrakow I think. |
Beta Was this translation helpful? Give feedback.
-
it's not that big cost, imho it's just like accessing hash elements. however, memory is the main issue at running models, like, on my computer i can't even run big models. with this thing we may run huge models on nearly any computer as there is no need to "unpack" anything. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I was thinking about reducing the memory size for the models. i'm terribly sorry if these obvious things are already implemented.. :)
what i suggest is: IF the weights currently are stored per parameter (which the file sizes hint at, at least), we can instead use a format, with the buckets of parameters per value. it has a lot of advantages, i will explain below. (keep patient, please, if it's starts with obvious things :)).
The bucket format is Value:[param_id,..].
for example:
1:[123,3211,12315,1231,12313,546346,456464,1232,..]
7:[45,676575,234234,...]
(etc, where 2,3,4,5,6 are absent weight values in the model)
this way we save a lot of space and memory.
real neurons are never noise proof and always change the conductivity, making weights non-constant yet stable. we can use it to our advantage by adding a random value during the phase of reading values before multiplication, when we do the inference.
this way we can virtually increase the precision, during the calculation, upgrading our 4bit data, to, say, 6bit or 8bit. we just need to take care and not to overlap the original 4 bit borders with our new improved value.
sounds not so fascinating yet, isn't it. now the final part :).
to do that, before saving the trained quantized model, we can simply sort the parameters' ids within the storage buckets in an order reflecting their original unquantized values that they had originally.
so, if before quantization we had (param id:weight value):
1:8, 2:4, 3:2, 4:8, 5:10, 6:5, 7:9, .......
after our QLORA quantization they all got shrank to value=1, fitting into the bucket 1 (for values of 1-10 in terms of original model).
our stored model format in that case would be:
(weight value: [param ids])
1:{3,2,6,4,1,7,5}
then, at the inference step, we can "dequantize" and to "restore" the spread of these parameters across the values range of the bucket (1-10) in accordance with position of params within of the bucket.
of course it is a loss in compare to original thing, but it is MUCH more precise than just quantized thing, even with QLORA.
and the beauty is that it is done only during the calculation phase of the inference, so the whole model weights in memory will still be saved with the small footprint :).
bonus: not that necessary step 4.
:)
Beta Was this translation helpful? Give feedback.
All reactions