Possibly possible size optimization, with a better precision #6099

drazdra · 2024-03-16T11:55:22Z

drazdra
Mar 16, 2024

Hi,

I was thinking about reducing the memory size for the models. i'm terribly sorry if these obvious things are already implemented.. :)

low quantized models, like 4-5 bits are most common now and they have quite a small amount of possible values, especially considered the concentration of the values in certain regions. QLORA addresses that somewhat. it presents us with a nice opportunity.

what i suggest is: IF the weights currently are stored per parameter (which the file sizes hint at, at least), we can instead use a format, with the buckets of parameters per value. it has a lot of advantages, i will explain below. (keep patient, please, if it's starts with obvious things :)).

The bucket format is Value:[param_id,..].
for example:
1:[123,3211,12315,1231,12313,546346,456464,1232,..]
7:[45,676575,234234,...]

(etc, where 2,3,4,5,6 are absent weight values in the model)

this way we save a lot of space and memory.

second idea comes from the first one and i believe is not implemented yet.

real neurons are never noise proof and always change the conductivity, making weights non-constant yet stable. we can use it to our advantage by adding a random value during the phase of reading values before multiplication, when we do the inference.

this way we can virtually increase the precision, during the calculation, upgrading our 4bit data, to, say, 6bit or 8bit. we just need to take care and not to overlap the original 4 bit borders with our new improved value.

sounds not so fascinating yet, isn't it. now the final part :).

we can make this work not just randomly but.. we can actually make it help us partially RESTORE the original precision.

to do that, before saving the trained quantized model, we can simply sort the parameters' ids within the storage buckets in an order reflecting their original unquantized values that they had originally.

so, if before quantization we had (param id:weight value):
1:8, 2:4, 3:2, 4:8, 5:10, 6:5, 7:9, .......

after our QLORA quantization they all got shrank to value=1, fitting into the bucket 1 (for values of 1-10 in terms of original model).

our stored model format in that case would be:
(weight value: [param ids])
1:{3,2,6,4,1,7,5}

then, at the inference step, we can "dequantize" and to "restore" the spread of these parameters across the values range of the bucket (1-10) in accordance with position of params within of the bucket.

of course it is a loss in compare to original thing, but it is MUCH more precise than just quantized thing, even with QLORA.

and the beauty is that it is done only during the calculation phase of the inference, so the whole model weights in memory will still be saved with the small footprint :).

bonus: not that necessary step 4.

the final topping would be to predefine several common distribution functions. then, we can just save the id of that function at the moment of saving quantized data. it would require nearly no extra space as it's just some bytes per whole file. but later we could use that function to restore values more precisely across the bucket. although, it would quite slow down the calculations as it would not be mere multiplication anymore. yet, for higher precision at low memory it can work even much better than 3.

:)

ngxson · 2024-03-16T13:05:19Z

ngxson
Mar 16, 2024
Collaborator

The bucket format is Value:[param_id,..].
for example:
1:[123,3211,12315,1231,12313,546346,456464,1232,..]
7:[45,676575,234234,...]

I don't really get the point here: a Q4 param for example, can be stored using 4 bits.

Now with this format, each param is store with its ID, meaning a model having 7B params will need to have unit64 as param ID, thus now we have 64 bits per param?

Also storing like this will come with a cost of having to copy data before each matrix multiplications. Yeah you can kinda "decompress" on loading the model into memory, but that defeats the goal of quantizing: to make the model fit into GPU having low amount VRAM.

Also in llama.cpp, we're currently store quantized params in buckets, it's called a block. Each block has its scale, that makes Q4_K so effective. See: #397

this way we can virtually increase the precision, during the calculation, upgrading our 4bit data, to, say, 6bit or 8bit. we just need to take care and not to overlap the original 4 bit borders with our new improved value.

So that mean dequantize Q4 to Q6 or Q8 right? AFAIK dequantizing will come with performance cost. Most backends implement kernels to do calculations directly with quantized values.

There's always trade-off among model size - performance - quality.. It's better to check with @ikawrakow I think.

4 replies

drazdra Mar 16, 2024
Author

Now with this format, each param is store with its ID, meaning a model having 7B params will need to have unit64 as param ID, thus now we have 64 bits per param?

if right now you just store the weights with no param ids, and calculate the param ID by the position of 4bits in the data image, and there is no internal representation of pairs id:value at work, then you're correct here and my suggestion is stupid :). although, the part number 2 still might have an effect but it's not that worthy without number 3, i believe.

So that mean dequantize Q4 to Q6 or Q8 right? AFAIK dequantizing will come with performance cost. Most backends implement kernels to do calculations directly with quantized values.

that's true of course as it's going to be an extra operation of converting the 4bit value to 6b or 8b based on its position. but having a feature that could turn 4b model into something closer to 8b can be quite tasty. right now main issue is memory and speed with small models is quite fast even on a slow cpu only hardware.

ngxson Mar 16, 2024
Collaborator

In fact, you can imagine that params are stored as N-dimension array inside tensor. A tensor have a metadata to tell its shape, for example 3x4, then in the data of that tensor will have exactly 12 floats.

Sorry I don't really understand the idea of the "adding a random value" part, looks like the idea of "dropout" but I can't make a connection, so I can't comment on that.

AFAIK, the Q4 is in fact dequantized into Q8 (to be exact, int8) in order to be used in calculations. That's done because most hardwares support "int8 intrinsics", that means it can do vector calculations with int8. Having a method to dequant Q4 into int8 while restoring the precision is important I think.

I recommend you also have a look on #5551 as there's info about how calculations are done with quantized value.

And don't worry, here we're all learning (even me, I'm not from math or machine learning background), so no question or ideas are useless ;-) I'm currently even struggle to understand how quantization with imatrix works (the IQ3/IQ2/IQ1 family), I may need more time to dive deep into this subject.

drazdra Mar 22, 2024
Author

actually, after giving it a second thought.. it can still be a good idea.

within aforementioned buckets we do NOT have to list all the parameters. as can simply list RANGES of parameters. since the model is quantized, say, to 4bit, these may prove to be efficient. the format would be then:

The bucket format is Value:[param_id range, ..].
for example:
1:[123-154,3211,12315-12366,.. ]
2: ...

in that case we won't have to list all of the ids of params and the size might profit in compare to classic format.

another idea here, is that if we rearrange parameters IDs within the model, according to their values, then our ranges will be extremely short. and the whole model size can then end up in some kilobytes?

ngxson Mar 22, 2024
Collaborator

if we rearrange parameters IDs within the model,

I don't get the idea here, I think it's impossible. You cannot rearrange elements inside a tensor, that will change the data of the tensor itself.

1:[123-154,3211,12315-12366,.. ]
2: ...

Your idea will work if the model has many repeated sequence of param. In short, what you proposed is simply a basic "zip" algorithm using a predefined dictionary.

Then even if you can pack the model into some kilobytes, don't forget that there's a cost to unpack the tensors during inference. You can program a CPU do unpack your data format, but that won't work on CUDA.

drazdra · 2024-03-22T14:30:00Z

drazdra
Mar 22, 2024
Author

Then even if you can pack the model into some kilobytes, don't forget that there's a cost to unpack the tensors during inference. You can program a CPU do unpack your data format, but that won't work on CUDA.

it's not that big cost, imho it's just like accessing hash elements. however, memory is the main issue at running models, like, on my computer i can't even run big models. with this thing we may run huge models on nearly any computer as there is no need to "unpack" anything.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly possible size optimization, with a better precision #6099

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possibly possible size optimization, with a better precision #6099

drazdra Mar 16, 2024

Replies: 2 comments · 4 replies

ngxson Mar 16, 2024 Collaborator

drazdra Mar 16, 2024 Author

ngxson Mar 16, 2024 Collaborator

drazdra Mar 22, 2024 Author

ngxson Mar 22, 2024 Collaborator

drazdra Mar 22, 2024 Author

drazdra
Mar 16, 2024

Replies: 2 comments 4 replies

ngxson
Mar 16, 2024
Collaborator

drazdra Mar 16, 2024
Author

ngxson Mar 16, 2024
Collaborator

drazdra Mar 22, 2024
Author

ngxson Mar 22, 2024
Collaborator

drazdra
Mar 22, 2024
Author