LLM Quantisations for Local Models

Old Quants

Configuration Chunk size Bits per Weight Scale Value Bias Value Size per Weight
q4_0 32 4 32-bit float N/A 5 bits
q4_1 32 4 32-bit float 32-bit float 6 bits
q4_2 16 4 16-bit float N/A 5 bits
q4_3 16 4 16-bit float 16-bit float 6 bits
q5_0 32 5 16-bit float N/A 5.5 bits
q5_1 32 5 16-bit float 16-bit float 6 bits
q8_0 32 8 32-bit float N/A 9 bits

“With 7B, use Q5_1. With 13B and above, Q4_2 is a great compromise between speed and quality if you don’t want to go with the slower, resource-heavier Q5_1.
Q4_0 is only relevant for compatibility and should be avoided when possible.”

K-Quants

https://github.com/ggerganov/llama.cpp/pull/1684

Model Measure F16 Q2_K Q3_K_M Q4_K_S Q5_K_S Q6_K
7B perplexity 5.9066 6.7764 6.1503 6.0215 5.9419 5.9110
7B file size 13.0G 2.67G 3.06G 3.56G 4.33G 5.15G
7B ms/tok @ 4th M2 Max 116 56 69 50 70 75
7B ms/tok @ 8th M2 Max 111 36 36 36 44 51
7B ms/tok @ 4th RTX-4080 60 15.5 17.0 15.5 16.7 18.3
7B ms/tok @ 4th Ryzen 214 57 61 68 81 93
13B perplexity 5.2543 5.8545 5.4498 5.3404 5.2785 5.2568
13B file size 25.0G 5.13G 5.88G 6.80G 8.36G 9.95G
13B ms/tok @ 4th M2 Max 216 103 148 95 132 142
13B ms/tok @ 8th M2 Max 213 67 77 68 81 95
13B ms/tok @ 4th RTX-4080 - 25.3 29.3 26.2 28.6 30.0
13B ms/tok @ 4th Ryzen 414 109 118 130 156 180

Q4_K_S seems optimal speed/perplexity