Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster AVX2 prompt processing for k-quants and IQ4_XS #394

Merged
merged 1 commit into from May 7, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented May 3, 2024

As discussed elsewhere, here is a PR that improves AVX2 prompt processing for k-quants and IQ4_XS by a large margin. I did not manage to get the speed gains via tinyBlas, so I just added a call in llamafile_sgemm() to a separate function that performs the matrix multiplication.

The table shows a comparison between prompt processing speed on master and with this PR. Not having the llama-bench tool here and not knowing how to better measure performance, I just used the perplexity tool to measure time for a batch of 512 tokens to get these values. Tested on a 16-core Ryzen-7950X CPU with a 7B LLaMA model

Quants PP-512 (PR) PP-512 (master) Speedup
Q2_K_S 157.5 111.5 1.412
Q3_K_S 169.8 81.3 2.089
Q4_K_S 161.0 105.1 1.531
Q5_K_S 146.7 72.7 2.017
Q6_K 171.8 74.4 2.308
IQ4_XS 147.6 74.1 1.992

For reference, here is what I measure on my system for fp16 and quants not affected by this PR:

Quants PP-512 (master)
fp16 136.5
Q8_0 129.6
Q4_0 101.4
Q4_1 66.8
Q5_0 56.5
Q5_1 54.2

I.e., all k-quants and IQ4_XS are now faster than fp16!

The speedup in this PR is in most cases better compared to what I reported here due to some additional refinements that I have added since this post, but a few percent slower compared to what I get in my private llama.cpp fork (with Q2_K_S having the most noticeable difference as I get 178 t/s there). Being new to llamafile, I'm not sure what is causing such performance differences for the exact same matrix multiplication implementation.

The same approach as here results in huge performance gains for the other i-quants (IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S). But having modified these quants in my repository in ways that make them incompatible with mainline llama.cpp i-quants, I have left this part for a future PR.

The Ryzen-7950X implements various parts of the AVX512 specification. To make sure that this PR provides speedup on non-AVX512 CPUs, I also tested on an older Ryzen-5975WX 32-core CPU. Here I get the following performance for fp16 and unaffected quants:

Quants PP-512 (master)
fp16 108.7
Q8_0 135.1
Q4_0 118.8

For k-quants and IQ4_XS we have

Quants PP-512 (PR) PP-512 (master) Speedup
Q2_K_S 199.2 153.7 1.296
Q3_K_S 191.8 115.3 1.663
Q4_K_S 182.2 140.6 1.296
Q5_K_S 171.8 102.8 1.671
Q6_K 187.5 107.8 1.739
IQ4_XS 191.8 110.6 1.734

@jart
Copy link
Collaborator

jart commented May 7, 2024

This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x! Here are my measurements, on three different computers, for three different models.

Iwan Kawrakow's new GEMM function for K-quants

Before: 89c189e
After: ddb9a8c

prompt evalution speed (a.k.a. prefill) in tokens per second

MODEL quant microprocessor before after speedup
TinyLLaMA 1.1B Q2_K Intel i9-9900 204 340 1.66x
TinyLLaMA 1.1B Q3_K_S Intel i9-9900 160 317 1.98x
TinyLLaMA 1.1B Q3_K_M Intel i9-9900 174 309 1.77x
TinyLLaMA 1.1B Q4_0 Intel i9-9900 167 - -
TinyLLaMA 1.1B Q5_K_M Intel i9-9900 147 280 1.90x
TinyLLaMA 1.1B Q8_0 Intel i9-9900 219 - -
TinyLLaMA 1.1B F16 Intel i9-9900 251 - -
TinyLLaMA 1.1B BF16 Intel i9-9900 222 - -
TinyLLaMA 1.1B Q2_K Intel i9-14900K 300 600 2.00x
TinyLLaMA 1.1B Q3_K_S Intel i9-14900K 289 606 2.10x
TinyLLaMA 1.1B Q3_K_M Intel i9-14900K 316 606 1.92x
TinyLLaMA 1.1B Q4_0 Intel i9-14900K 418 - -
TinyLLaMA 1.1B Q5_K_M Intel i9-14900K 275 570 2.07x
TinyLLaMA 1.1B Q8_0 Intel i9-14900K 467 - -
TinyLLaMA 1.1B F16 Intel i9-14900K 405 - -
TinyLLaMA 1.1B BF16 Intel i9-14900K 97 - -
TinyLLaMA 1.1B Q2_K Ryzen 7995WX 1350 1667 1.23x
TinyLLaMA 1.1B Q3_K_S Ryzen 7995WX 1181 1648 1.39x
TinyLLaMA 1.1B Q3_K_M Ryzen 7995WX 1248 1636 1.31x
TinyLLaMA 1.1B Q4_0 Ryzen 7995WX 1379 - -
TinyLLaMA 1.1B Q5_K_M Ryzen 7995WX 961 1626 1.69x
TinyLLaMA 1.1B F16 Ryzen 7995WX 1230 - -
TinyLLaMA 1.1B BF16 Ryzen 7995WX 1800 - -
LLaMA 3 8B Q4_0 Intel i9-9900 27 - -
LLaMA 3 8B Q4_K_M Intel i9-9900 28 41 1.46x
LLaMA 3 8B Q4_0 Intel i9-14900K 62 - -
LLaMA 3 8B Q4_K_M Intel i9-14900K 57 90 1.57x
LLaMA 3 8B F16 Intel i9-14900K 59 - -
LLaMA 3 8B Q3_K_S Ryzen 7995WX 225 416 1.84x
LLaMA 3 8B Q4_0 Ryzen 7995WX 278 - -
LLaMA 3 8B Q4_K_S Ryzen 7995WX 188 386 2.05x
LLaMA 3 8B F16 Ryzen 7995WX 357 - -
LLaMA 3 8B BF16 Ryzen 7995WX 508 - -
LLaMA 3 70B Q2_K Ryzen 7995WX 31 51 1.65x
LLaMA 3 70B Q3_K_S Ryzen 7995WX 23 44 1.91x
LLaMA 3 70B Q4_0 Ryzen 7995WX 31 - -
LLaMA 3 70B F16 Ryzen 7995WX 42 - -
LLaMA 3 70B BF16 Ryzen 7995WX 65 - -

text generation speed (a.k.a. prediction) in tokens per second

MODEL quant microprocessor before after speedup
TinyLLaMA 1.1B Q2_K Intel i9-9900 48 57 1.18x
TinyLLaMA 1.1B Q3_K_S Intel i9-9900 44 50 1.13x
TinyLLaMA 1.1B Q3_K_M Intel i9-9900 42 47 1.11x
TinyLLaMA 1.1B Q4_0 Intel i9-9900 34 - -
TinyLLaMA 1.1B Q5_K_M Intel i9-9900 32 35 1.09x
TinyLLaMA 1.1B Q8_0 Intel i9-9900 25 - -
TinyLLaMA 1.1B F16 Intel i9-9900 15 - -
TinyLLaMA 1.1B BF16 Intel i9-9000 15 - -
TinyLLaMA 1.1B Q2_K Intel i9-14900K 102 129 1.26x
TinyLLaMA 1.1B Q3_K_S Intel i9-14900K 99 125 1.26x
TinyLLaMA 1.1B Q3_K_M Intel i9-14900K 96 113 1.17x
TinyLLaMA 1.1B Q4_0 Intel i9-14900K 86 - -
TinyLLaMA 1.1B Q5_K_M Intel i9-14900K 74 83 1.12x
TinyLLaMA 1.1B Q8_0 Intel i9-14900K 64 - -
TinyLLaMA 1.1B F16 Intel i9-14900K 41 - -
TinyLLaMA 1.1B BF16 Intel i9-14900K 68 - -
TinyLLaMA 1.1B Q2_K Ryzen 7995WX 129 160 1.24x
TinyLLaMA 1.1B Q3_K_S Ryzen 7995WX 123 158 1.28x
TinyLLaMA 1.1B Q3_K_M Ryzen 7995WX 122 160 1.31x
TinyLLaMA 1.1B Q4_0 Ryzen 7995WX 129 - -
TinyLLaMA 1.1B Q5_K_M Ryzen 7995WX 109 147 1.34x
TinyLLaMA 1.1B F16 Ryzen 7995WX 88 - -
TinyLLaMA 1.1B BF16 Ryzen 7995WX 79 - -
LLaMA 3 8B Q4_0 Intel i9-9900 6 - -
LLaMA 3 8B Q4_K_M Intel i9-9900 6 6 1.00x
LLaMA 3 8B Q4_0 Intel i9-14900K 16 - -
LLaMA 3 8B Q4_K_M Intel i9-14900K 15 16 1.06x
LLaMA 3 8B F16 Intel i9-14900K 6 - -
LLaMA 3 8B Q3_K_S Ryzen 7995WX 34 46 1.35x
LLaMA 3 8B Q4_0 Ryzen 7995WX 37 - -
LLaMA 3 8B Q4_K_S Ryzen 7995WX 32 42 1.31x
LLaMA 3 8B F16 Ryzen 7995WX 19 - -
LLaMA 3 8B BF16 Ryzen 7995WX 20 - -
LLaMA 3 70B Q2_K Ryzen 7995WX 6 8 1.33x
LLaMA 3 70B Q3_K_S Ryzen 7995WX 6 7 1.16x
LLaMA 3 70B Q4_0 Ryzen 7995WX 5 - -
LLaMA 3 70B F16 Ryzen 7995WX 2 - -
LLaMA 3 70B BF16 Ryzen 7995WX 2 - -

Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Once I get a release out, how would you like to announce it the world? I would like to write a blog post. If you write your own, then I'm happy to tweet that.

@jart jart merged commit e6532f7 into Mozilla-Ocho:main May 7, 2024
1 check passed
@mrdomino
Copy link
Contributor

mrdomino commented May 7, 2024

bill and ted going "whoa"

@stlhood
Copy link
Collaborator

stlhood commented May 7, 2024

@ikawrakow thank you for this major contribution to the project!

@ikawrakow
Copy link
Contributor Author

Looks good to me. Once I get a release out, how would you like to announce it the world? I would like to write a blog post. If you write your own, then I'm happy to tweet that.

I'm not much into blogging, so if you like writing about this, please go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants