LLM Quantisations for Local Models
Old Quants
Configuration | Chunk size | Bits per Weight | Scale Value | Bias Value | Size per Weight |
---|---|---|---|---|---|
q4_0 | 32 | 4 | 32-bit float | N/A | 5 bits |
q4_1 | 32 | 4 | 32-bit float | 32-bit float | 6 bits |
q4_2 | 16 | 4 | 16-bit float | N/A | 5 bits |
q4_3 | 16 | 4 | 16-bit float | 16-bit float | 6 bits |
q5_0 | 32 | 5 | 16-bit float | N/A | 5.5 bits |
q5_1 | 32 | 5 | 16-bit float | 16-bit float | 6 bits |
q8_0 | 32 | 8 | 32-bit float | N/A | 9 bits |
“With 7B, use Q5_1. With 13B and above, Q4_2 is a great compromise between speed and quality if you don’t want to go with the slower, resource-heavier Q5_1.
Q4_0 is only relevant for compatibility and should be avoided when possible.”
K-Quants
https://github.com/ggerganov/llama.cpp/pull/1684
Model | Measure | F16 | Q2_K | Q3_K_M | Q4_K_S | Q5_K_S | Q6_K |
---|---|---|---|---|---|---|---|
7B | perplexity | 5.9066 | 6.7764 | 6.1503 | 6.0215 | 5.9419 | 5.9110 |
7B | file size | 13.0G | 2.67G | 3.06G | 3.56G | 4.33G | 5.15G |
7B | ms/tok @ 4th M2 Max | 116 | 56 | 69 | 50 | 70 | 75 |
7B | ms/tok @ 8th M2 Max | 111 | 36 | 36 | 36 | 44 | 51 |
7B | ms/tok @ 4th RTX-4080 | 60 | 15.5 | 17.0 | 15.5 | 16.7 | 18.3 |
7B | ms/tok @ 4th Ryzen | 214 | 57 | 61 | 68 | 81 | 93 |
13B | perplexity | 5.2543 | 5.8545 | 5.4498 | 5.3404 | 5.2785 | 5.2568 |
13B | file size | 25.0G | 5.13G | 5.88G | 6.80G | 8.36G | 9.95G |
13B | ms/tok @ 4th M2 Max | 216 | 103 | 148 | 95 | 132 | 142 |
13B | ms/tok @ 8th M2 Max | 213 | 67 | 77 | 68 | 81 | 95 |
13B | ms/tok @ 4th RTX-4080 | - | 25.3 | 29.3 | 26.2 | 28.6 | 30.0 |
13B | ms/tok @ 4th Ryzen | 414 | 109 | 118 | 130 | 156 | 180 |
Q4_K_S seems optimal speed/perplexity