Skip to main content

Quantization

Quantization reduces model size and improves inference speed by representing weights and activations with lower precision. Cactus supports INT4, INT8, FP16, and FP32 precision types with hardware-optimized kernels.

Precision Types

INT4

4-bit integer
  • 0.5 bytes per parameter
  • 8x smaller than FP32
  • Minimal quality loss
  • Best for mobile

INT8

8-bit integer
  • 1 byte per parameter
  • 4x smaller than FP32
  • Near-lossless quality
  • Balanced option

FP16

16-bit float
  • 2 bytes per parameter
  • 2x smaller than FP32
  • Lossless quality
  • NPU-friendly

Precision Comparison

PrecisionBytes/Param1B Model SizeRelative SpeedQuality
FP324~4 GB1x (baseline)100%
FP162~2 GB1.5-2x100%
INT81~1 GB2-3x99.5%
INT40.5~500 MB3-4x98-99%
Recommendation: Use INT4 for most mobile applications. The quality difference is negligible (less than 1% on most benchmarks) while providing the best memory efficiency.

How Quantization Works

Group Quantization

Cactus uses group quantization where weights are quantized in groups of 32 elements with per-group FP16 scales:
// FP16 weights → INT8 with group scales
const __fp16* fp16_weights = /* ... */;  // Original weights
int8_t* int8_weights = /* ... */;         // Quantized weights
__fp16* scales = /* ... */;               // Per-group scales

size_t group_size = 32;
size_t num_groups = (weight_count + group_size - 1) / group_size;

for (size_t g = 0; g < num_groups; g++) {
    // Find max absolute value in group
    float max_abs = 0.0f;
    for (size_t i = 0; i < group_size; i++) {
        float val = fp16_weights[g * group_size + i];
        max_abs = std::max(max_abs, std::abs(val));
    }
    
    // Compute scale: max_abs / 127 (for INT8)
    float scale = max_abs / 127.0f;
    scales[g] = (__fp16)scale;
    
    // Quantize each element
    for (size_t i = 0; i < group_size; i++) {
        float val = fp16_weights[g * group_size + i];
        int8_weights[g * group_size + i] = (int8_t)(val / scale);
    }
}
Why Group Quantization?
  • Better accuracy than per-tensor quantization
  • Captures local weight distribution
  • Group size of 32 optimizes for ARM SIMD (NEON)
  • Minimal overhead (one FP16 scale per 32 weights)

INT4 Packing

INT4 weights are packed 2 values per byte:
// Pack two INT4 values into one byte
uint8_t packed = (val1 & 0x0F) | ((val2 & 0x0F) << 4);

// Unpack INT4 values
int8_t val1 = packed & 0x0F;           // Lower 4 bits
int8_t val2 = (packed >> 4) & 0x0F;    // Upper 4 bits

// Sign-extend from 4-bit to 8-bit
val1 = (val1 ^ 0x08) - 0x08;  // -8 to +7 range
val2 = (val2 ^ 0x08) - 0x08;
Storage Savings:
  • 2 values per byte → 2x memory reduction vs INT8
  • Group size still 32 for scale alignment
  • Unpacking happens in SIMD kernels

Quantization in Practice

Choosing Precision

Quantization at Inference Time

Cactus handles mixed precision automatically:
Weights are quantized once during model conversion:
// Weights stored as INT4/INT8 in files
size_t weight = graph.mmap_weights("layer.0.weight");

// Scales stored alongside quantized weights
graph.set_grouped_scales(
    weight,
    group_size = 32,
    num_groups,
    scales_ptr
);
Dequantization happens inside SIMD kernels:
// INT8 matmul with group scales
cactus_matmul_int8(
    activations_int8, act_scales,
    weights_int8, weight_scales,
    output_fp16,
    M, K, N, group_size = 32
);
Activations are quantized dynamically:
// Activations start as FP16
size_t hidden = graph.matmul(input, weight);

// Quantize for next layer
size_t quantized = graph.quantize_activations(hidden);
Per-tensor quantization:
  • Find max absolute value across tensor
  • Scale = max_abs / 127.0
  • Quantize: int8_val = fp16_val / scale
Automatic in Graph:
// Graph handles precision automatically
size_t a = graph.input({M, K}, Precision::INT8);
size_t b = graph.mmap_weights("weight");  // INT4/INT8
size_t c = graph.matmul(a, b);  // Output: FP16

// Precision cast if needed
size_t c_int8 = graph.precision_cast(c, Precision::INT8);
Key-Value cache is quantized to INT8 to save memory:
// Quantize KV cache after each decode step
void KVCache::update_from_graph(
    CactusGraph* gb,
    const std::vector<size_t>& k_nodes,
    const std::vector<size_t>& v_nodes,
    size_t seq_len,
    size_t num_layers,
    size_t kv_heads,
    size_t head_dim
) {
    for (size_t layer = 0; layer < num_layers; layer++) {
        const __fp16* k_fp16 = gb->get_output(k_nodes[layer]);
        const __fp16* v_fp16 = gb->get_output(v_nodes[layer]);
        
        // Quantize with group size 32
        cactus_quantize_kv_fp16_to_int8(
            k_fp16, layer_caches[layer].keys.data(),
            layer_caches[layer].key_scales.data(),
            seq_len, kv_heads, head_dim,
            group_size = 32
        );
        
        cactus_quantize_kv_fp16_to_int8(
            v_fp16, layer_caches[layer].values.data(),
            layer_caches[layer].value_scales.data(),
            seq_len, kv_heads, head_dim,
            group_size = 32
        );
    }
}
Memory Savings:
Context LengthFP16 CacheINT8 CacheSavings
512 tokens64 MB32 MB50%
1024 tokens128 MB64 MB50%
2048 tokens256 MB128 MB50%
Based on 28-layer model with 8 KV heads and 128 head dim

Optimized Kernels

Cactus provides SIMD-optimized kernels for quantized operations:

INT4 Matrix Multiplication

void cactus_matmul_int4(
    const int8_t* A,              // INT8 activations
    const float* A_scales,        // Activation scales
    const int8_t* B_packed,       // INT4 weights (packed)
    const __fp16* B_scales,       // Weight scales per group
    __fp16* C,                    // FP16 output
    size_t M, size_t K, size_t N,
    size_t group_size             // 32
);
NEON Implementation:
  • Unpack INT4 → INT8 in SIMD registers
  • INT8×INT8 → INT32 accumulation (16 elements at a time)
  • Apply scales and convert to FP16
  • ~3-4x faster than FP32 matmul

INT8 Matrix Multiplication

void cactus_matmul_int8(
    const int8_t* A,              // INT8 activations
    const float* A_scales,        // Activation scales
    const int8_t* B,              // INT8 weights
    const __fp16* B_scales,       // Weight scales per group
    __fp16* C,                    // FP16 output
    size_t M, size_t K, size_t N,
    size_t group_size             // 32
);
NEON Implementation:
  • INT8×INT8 → INT32 dot products
  • 16-way SIMD parallelism
  • Fused dequantization
  • ~2-3x faster than FP32 matmul

Hybrid Attention (INT8 Cache + FP16 Queries)

void cactus_attention_hybrid_int8_fp16(
    const __fp16* queries,         // FP16 queries
    const int8_t* keys_cached,     // INT8 cached keys
    const int8_t* values_cached,   // INT8 cached values
    const float* k_scales,         // Key scales
    const float* v_scales,         // Value scales
    const __fp16* keys_new,        // FP16 new keys
    const __fp16* values_new,      // FP16 new values
    __fp16* output,
    size_t batch_size, size_t seq_len,
    size_t cache_len, size_t new_len,
    size_t num_q_heads, size_t num_kv_heads,
    size_t head_dim, float scale,
    size_t position_offset,
    bool is_causal,
    size_t window_size,
    size_t group_size = 32
);
Features:
  • Fused attention over INT8 cache + FP16 new tokens
  • Dequantization happens in-kernel
  • Sliding window support
  • No quality loss with group quantization

Code Examples

Quantizing Weights (Python)

import cactus
import numpy as np

# Load FP32 weights
weights = np.load("weights.npy")  # shape: (4096, 4096)

# Quantize to INT4
quantizer = cactus.Quantizer(precision="INT4", group_size=32)
quantized_weights, scales = quantizer.quantize(weights)

print(f"Original size: {weights.nbytes / 1024 / 1024:.2f} MB")
print(f"Quantized size: {quantized_weights.nbytes / 1024 / 1024:.2f} MB")
print(f"Compression ratio: {weights.nbytes / quantized_weights.nbytes:.1f}x")

# Dequantize for validation
reconstructed = quantizer.dequantize(quantized_weights, scales)
error = np.abs(weights - reconstructed).mean()
print(f"Mean absolute error: {error:.6f}")

Using Graph API with Quantization (C++)

#include "cactus/graph/graph.h"

CactusGraph graph;

// Input activations (INT8)
size_t input = graph.input({1, 512}, Precision::INT8);

// Load quantized weights (INT4)
size_t weight = graph.mmap_weights("layer.0.weight");
graph.set_grouped_scales(weight, 32, num_groups, scales_ptr);

// Matrix multiply (INT8 × INT4 → FP16)
size_t output = graph.matmul(input, weight);

// Set input data
int8_t input_data[512];
float input_scale = 0.05f;
graph.set_input(input, input_data, Precision::INT8);

// Execute
graph.execute();

// Get output (FP16)
__fp16* result = (__fp16*)graph.get_output(output);

Model Conversion Script

import cactus

# Load model from HuggingFace
model = cactus.convert_from_huggingface(
    "google/gemma-3-270m-it",
    precision="INT4",
    group_size=32,
    output_dir="./weights/gemma-3-270m-int4"
)

print(f"Model converted successfully")
print(f"Size: {model.size_mb:.1f} MB")
print(f"Precision: {model.precision}")

Quantization Performance

Memory Usage

1.2B Parameter Model (LFM2-1.2B):
ComponentFP32FP16INT8INT4
Weights4.8 GB2.4 GB1.2 GB600 MB
Activations128 MB64 MB32 MB32 MB
KV Cache (1K ctx)256 MB128 MB64 MB64 MB
Total5.2 GB2.6 GB1.3 GB700 MB

Inference Speed

iPhone 17 Pro, LFM2-1.2B:
PrecisionPrefillDecodeTime to First Token
FP3285 t/s22 t/s380 ms
FP16245 t/s35 t/s165 ms
INT8310 t/s42 t/s130 ms
INT4327 t/s48 t/s120 ms
INT4 is fastest because:
  1. Smaller memory footprint → better cache utilization
  2. Fewer memory transfers → less bandwidth pressure
  3. SIMD-optimized INT4 kernels
  4. Lower precision = higher throughput

Quality Impact

Benchmark Results

LFM2-1.2B on MMLU (0-shot):
PrecisionAccuracyΔ vs FP32
FP3252.3%-
FP1652.3%0.0%
INT852.1%-0.2%
INT451.8%-0.5%
Gemma-3-270m on HellaSwag:
PrecisionAccuracyΔ vs FP32
FP3245.2%-
FP1645.2%0.0%
INT845.0%-0.2%
INT444.7%-0.5%
Whisper-Small on Librispeech (WER):
PrecisionWERΔ vs FP32
FP323.2%-
FP163.2%0.0%
INT83.2%0.0%
INT43.3%+0.1%
Key Takeaway: INT4 quantization with group size 32 maintains 98-99% of original model quality while reducing memory by 8x.

Best Practices

  • INT4 for models > 500M parameters on mobile
  • INT8 for models < 500M parameters or quality-sensitive tasks
  • FP16 for NPU execution or maximum quality
  • Use --reconvert if changing precision
  • Always use group size 32 (default)
  • Smaller groups (16) = higher quality, larger scales
  • Larger groups (64) = more memory efficient, lower quality
  • Group size 32 is optimal for ARM NEON
  • Enable INT8 KV cache for contexts > 512 tokens
  • Saves 50% memory with no quality loss
  • Automatic in Cactus (no code changes needed)
kv_cache.set_window_size(1024, sink_size = 4);
// KV cache automatically quantized to INT8
  • Test INT4 quality on your specific use case
  • Use perplexity or task-specific metrics
  • Compare INT4 vs INT8 vs FP16 speed
cactus test --model LiquidAI/LFM2-1.2B --precision INT4 --benchmark
cactus test --model LiquidAI/LFM2-1.2B --precision INT8 --benchmark
  • Quantize-aware training (QAT) for best quality
  • Fine-tune FP32 → quantize → fine-tune INT4
  • Use larger learning rate for quantized fine-tuning See Fine-Tuning Guide for details.

Architecture

How quantization fits into Cactus’s design

Models

RAM usage for different model sizes

Graph API

Using quantization in computation graphs

Optimization Guide

Advanced performance tuning

Build docs developers (and LLMs) love