Quantization reduces the precision of model weights, shrinking model size and speeding up inference with minimal quality loss. This is essential for running large language models on consumer hardware.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ggml-org/llama.cpp/llms.txt
Use this file to discover all available pages before exploring further.
What is Quantization?
Quantization converts high-precision model weights (32-bit or 16-bit floats) to lower precision formats (2-8 bits). For example:- Original F32: 26 GB for a 7B model
- F16: 14 GB (50% reduction)
- Q4_K_M: ~4.5 GB (83% reduction)
- Q2_K: ~3 GB (88% reduction)
Quick Start
Thellama-quantize tool converts GGUF models from high precision to quantized formats:
Quantization Types
llama.cpp supports many quantization methods. Here are the most important ones:Recommended Quantization Levels
Q4_K_M - Best Balance (Recommended)
Q4_K_M - Best Balance (Recommended)
Quality: +0.18 ppl @ Llama-3-8B
Speed: Fast inferenceBest for: Most users, production deployments, good quality-to-size ratio
Q5_K_M - Higher Quality
Q5_K_M - Higher Quality
Quality: +0.06 ppl @ Llama-3-8B
Speed: Slightly slower than Q4Best for: When quality is more important than size, users with more RAM
Q8_0 - Near Original
Q8_0 - Near Original
Quality: +0.003 ppl @ Llama-3-8B
Speed: ModerateBest for: When maximum quality is needed, enough RAM available
Q2_K / Q3_K - Smallest Size
Q2_K / Q3_K - Smallest Size
Q3_K_M Size: ~3.7 GB for 7B model (+0.7 ppl)
Speed: Very fastBest for: Very limited RAM, mobile devices, when size is critical
Complete Quantization List
For reference, here’s the complete list of supported quantization types:| Type | Bits/Weight | Size (7B) | Perplexity | Description |
|---|---|---|---|---|
| IQ1_S | 1.56 | ~1.5 GB | - | Experimental 1-bit |
| IQ1_M | 1.75 | ~1.7 GB | - | 1-bit variant |
| IQ2_XXS | 2.06 | ~2.0 GB | - | Ultra-compressed |
| IQ2_XS | 2.31 | ~2.2 GB | - | 2-bit extra-small |
| IQ2_S | 2.50 | ~2.4 GB | - | 2-bit small |
| IQ2_M | 2.70 | ~2.6 GB | - | 2-bit medium |
| Q2_K | 2.96 | ~2.8 GB | +3.52 | 2-bit k-quant |
| Q2_K_S | 2.96 | ~2.8 GB | +3.18 | 2-bit k-quant small |
| IQ3_XXS | 3.06 | ~2.9 GB | - | 3-bit ultra-small |
| IQ3_XS | 3.30 | ~3.1 GB | - | 3-bit extra-small |
| IQ3_S | 3.44 | ~3.2 GB | - | 3-bit small |
| Q3_K_S | 3.41 | ~3.2 GB | +1.63 | 3-bit k-quant small |
| IQ3_M | 3.66 | ~3.5 GB | - | 3-bit medium mix |
| Q3_K_M | 3.74 | ~3.5 GB | +0.66 | 3-bit balanced |
| Q3_K_L | 4.03 | ~3.8 GB | +0.56 | 3-bit large |
| IQ4_XS | 4.25 | ~4.0 GB | - | 4-bit extra-small |
| Q4_0 | 4.34 | ~4.1 GB | +0.47 | Legacy 4-bit |
| IQ4_NL | 4.50 | ~4.3 GB | - | 4-bit non-linear |
| Q4_1 | 4.78 | ~4.5 GB | +0.45 | Legacy 4-bit variant |
| Q4_K_S | 4.37 | ~4.1 GB | +0.27 | 4-bit k-quant small |
| Q4_K_M | 4.58 | ~4.3 GB | +0.18 | 4-bit balanced ⭐ |
| Q5_0 | 5.21 | ~4.9 GB | +0.13 | Legacy 5-bit |
| Q5_1 | 5.65 | ~5.3 GB | +0.11 | Legacy 5-bit variant |
| Q5_K_S | 5.21 | ~4.9 GB | +0.10 | 5-bit k-quant small |
| Q5_K_M | 5.33 | ~5.0 GB | +0.06 | 5-bit balanced |
| Q6_K | 6.14 | ~5.8 GB | +0.02 | 6-bit k-quant |
| Q8_0 | 8.50 | ~8.0 GB | +0.003 | 8-bit quantization |
| F16 | 16.00 | ~14 GB | +0.002 | Half precision |
| BF16 | 16.00 | ~14 GB | -0.005 | BFloat16 |
| F32 | 32.00 | ~26 GB | baseline | Full precision |
Advanced Quantization
Using Importance Matrix (imatrix)
Importance matrix quantization uses statistical data from real prompts to minimize quality loss:Generate Importance Matrix
Advanced Options
Selective Tensor Quantization
Selective Tensor Quantization
Output Tensor Control
Output Tensor Control
Token Embedding Control
Token Embedding Control
Pure Quantization
Pure Quantization
--pure disables this.Requantization
You can requantize an already-quantized model, though quality loss accumulates:Complete Workflow Example
Here’s a complete example from raw model to optimized GGUF:Memory and Disk Requirements
Quantization requires enough memory and disk space for both input and output files:| Model Size | F16 Input | Q4_K_M Output | RAM Needed | Time (approx) |
|---|---|---|---|---|
| 1B | 2 GB | 0.7 GB | 4 GB | <1 min |
| 7B | 14 GB | 4.5 GB | 16 GB | 2-5 min |
| 13B | 26 GB | 8 GB | 32 GB | 5-10 min |
| 34B | 68 GB | 21 GB | 80 GB | 15-30 min |
| 70B | 140 GB | 43 GB | 160 GB | 30-60 min |
| 405B | 810 GB | 249 GB | 1 TB | 2-4 hours |
Online Quantization
If you don’t have sufficient hardware, use the GGUF-my-repo Hugging Face space:- Visit https://huggingface.co/spaces/ggml-org/gguf-my-repo
- Enter your model repository
- Select quantization levels (multiple at once)
- The space converts and quantizes automatically
- Results are published to your Hugging Face account
Choosing the Right Quantization
Decision Tree
Determine Your Constraints
- <8 GB: Use Q2_K or Q3_K_M
- 8-16 GB: Use Q4_K_M
- 16-32 GB: Use Q5_K_M or Q6_K
- 32+ GB: Use Q8_0 or F16
Assess Quality Needs
- Maximum quality: Q8_0 or F16
- High quality: Q5_K_M or Q6_K
- Balanced: Q4_K_M ⭐
- Size-constrained: Q3_K_M
- Extreme compression: Q2_K
Recommendations by Model Size
Small Models (1B-3B)
Small Models (1B-3B)
Medium Models (7B-13B)
Medium Models (7B-13B)
Large Models (30B-70B)
Large Models (30B-70B)
Huge Models (100B+)
Huge Models (100B+)
Evaluating Quality
Measure quantization quality using perplexity:Troubleshooting
Out of memory during quantization
Out of memory during quantization
Quantized model gives nonsensical output
Quantized model gives nonsensical output
- Quantization level too aggressive (Q2_K or lower)
- Corrupted quantization process
- Wrong model format
Cannot requantize an already quantized model
Cannot requantize an already quantized model
error: quantizing already quantized modelSolution: Add --allow-requantize flag, but note this degrades quality. Better to quantize from F16.Quantization is very slow
Quantization is very slow
Next Steps
After quantization:- Test the model to ensure quality is acceptable
- Benchmark performance with
llama-bench - Deploy using
llama-serveror integrate into your application - Share your quantized model on Hugging Face for others
- Supported Models - Check compatibility
- Converting Models - Get models into GGUF format
- Obtaining Models - Find pre-quantized models

