Trellis quants are based on a novel integer trellis rather than the scalar or block-based schemes used by other quant families. The integer trellis formulation enables reasonable CPU performance even at very low bits per weight — an unusual property at these compression levels.
Available types
| Type | Bits per weight | Notes |
|---|
IQ1_KT | ~1 | Extreme compression; quality highly dependent on model and imatrix |
IQ2_KT | ~2 | Aggressive compression; practical for very large models |
IQ3_KT | ~3 | Better quality retention than IQ2_KT at moderate size increase |
IQ4_KT | ~4 | Closest to standard 4-bit quality within the trellis family |
| Backend | Supported |
|---|
| CUDA | Yes |
| Metal | Yes |
| ARM NEON | Yes |
| CPU (AVX2) | Yes |
ROCm and Vulkan backends are not actively maintained. See the main README for details.
When to use trellis quants
Trellis quants are the right choice when memory constraints are severe and other options do not fit:
- Very large models (70B+) where even IQ2_K does not fit in available memory
- Situations where you need the smallest possible file at a given quality floor
- Deployments on hardware where 1–2 BPW is the only viable option
For most use cases where memory permits, IQK quants at equivalent BPW will provide better quality. Trellis quants trade some quality headroom for extreme size reduction.
Tradeoffs vs IQK quants
| IQK quants | Trellis quants |
|---|
| Quality at same BPW | Higher | Lower |
| File size at same BPW | Larger | Smaller |
| CPU performance | Good | Reasonable (novel integer trellis design) |
| Lowest available BPW | ~2 (IQ2_K) | ~1 (IQ1_KT) |
Quantizing a model
llama-quantize --imatrix model.imatrix model-bf16.gguf output-IQ2_KT.gguf IQ2_KT
Always use an imatrix with trellis quants. At 1–2 BPW, calibration data has a significant impact on output quality.
The same --custom-q and --dry-run options available for IQK quants also work with trellis types. See the IQK quants page for usage details.