Available types
Standard IQK types
| Type | Notes |
|---|---|
IQ2_K | 2-bit; aggressive compression with good quality when using an imatrix |
IQ2_KS | Slightly smaller than IQ2_K at similar quality |
IQ2_KL | Larger 2-bit variant for better quality retention |
IQ3_K | 3-bit; a practical floor for usable inference quality |
IQ4_K | 4-bit; balanced quality and size |
IQ4_KS | 4-bit variant optimised for size |
IQ4_KSS | More aggressive 4-bit compression |
IQ5_K | 5-bit; close to Q8_0 quality at a significantly smaller size |
IQ5_KS | 5-bit variant optimised for size |
IQ6_K | 6-bit; near-lossless, close to Q8_0 |
R4 variants (row-interleaved)
R4 types pack weights in an interleaved layout that improves CPU memory access patterns, giving better token-generation throughput on AVX2, Zen4, and ARM NEON.| Type |
|---|
IQ2_K_R4 |
IQ3_K_R4 |
IQ4_K_R4 |
IQ4_KS_R4 |
IQ5_K_R4 |
IQ5_KS_R4 |
-rtr (--run-time-repack) flag. This repacks non-R4 tensors on load when an interleaved variant is available.
MXFP4
MXFP4, as used in gpt-oss models, is supported on Zen4, AVX2, ARM NEON, Metal, and CUDA.Quantizing a model
Prepare a BF16 GGUF
Start from a BF16 base model. Quantizing from a higher-precision source gives the best results.
Generate an imatrix (recommended)
See the imatrix guide for the full command. An imatrix is not required but strongly recommended for quants below
Q6_0.Custom quantization mixes
Real models are not uniform — attention tensors, embedding layers, and FFN experts often benefit from different quantization levels. Use--custom-q to apply per-tensor rules via regular expressions:
IQ4_KS above) applies to all tensors not matched by any regex. Rules are evaluated in the order they are listed; the first match wins.
Dry run
Before running a full quantization, use--dry-run to preview which type each tensor will be assigned, without writing any output file:
Runtime repacking with -rtr
If you have a non-R4 model file but want R4 throughput on CPU, pass-rtr when starting the server or CLI: