Performance Tuning
Cactus provides several configuration options to optimize inference performance for your specific use case. This guide covers KV cache management, chunked prefill, TPS throttling, memory optimization, and benchmarking.
KV Cache Configuration
The Key-Value (KV) cache stores attention states from previous tokens, enabling fast autoregressive generation. Proper cache configuration balances memory usage and context length.
Sliding Window Cache
By default, Cactus uses a sliding window cache that keeps recent tokens + a small “sink” of early tokens.
#include "cactus/engine/engine.h"
cactus::Engine* engine = /* ... load model ... */;
// Set sliding window: 2048 recent tokens + 4 sink tokens
engine->set_cache_window(2048, 4);
Parameters:
window_size — Maximum recent tokens to cache (e.g., 1024, 2048, 4096)
sink_size — Number of initial tokens to always keep (default: 4)
Trade-offs:
| Window Size | Memory Usage | Context Length | Use Case |
|---|
| 512 | Low (~50MB) | Short | Chatbots, quick Q&A |
| 1024 | Medium (~100MB) | Medium | Most applications |
| 2048 | High (~200MB) | Long | Document analysis |
| 4096 | Very High (~400MB) | Very Long | RAG, long-form generation |
Memory scales with model size: The above estimates are for ~1B param models. Larger models use proportionally more cache memory.
Resetting Cache
// Clear KV cache (e.g., between conversations)
engine->reset_cache();
Call reset_cache() when:
- Starting a new conversation
- Switching contexts
- Memory pressure is high
Cache Quantization
Cactus automatically quantizes KV cache to INT8, providing 2x memory reduction with minimal quality loss.
Automatic since v1.7 (Oct 2025) — No configuration needed!
Chunked Prefill
Chunked prefill processes long prompts in chunks rather than all at once, reducing memory spikes and improving time-to-first-token on long contexts.
Configuring Chunk Size
// Process prefill in chunks of 256 tokens
std::vector<uint32_t> tokens = tokenizer->encode("Long prompt...");
engine->prefill(tokens, 256); // chunk_size = 256
Default: 256 tokens (optimal for most models)
Chunk Size Guidelines:
| Chunk Size | Memory | Latency | Throughput | Best For |
|---|
| 128 | Lowest | Higher | Lower | Budget devices, limited RAM |
| 256 | Low | Medium | High | Default — most cases |
| 512 | Medium | Low | Higher | High-end devices, speed priority |
| 1024+ | High | Lowest | Highest | Desktop/Mac only |
NPU prefill uses fixed chunk size: If NPU acceleration is enabled, the chunk size is determined by the NPU model architecture (typically 256) and cannot be changed at runtime. The chunk_size parameter is ignored.
From the Cactus README benchmarks:
LFM 1.2B (1k-prefill / 100-decode, values are prefill TPS / decode TPS):
| Device | Prefill TPS | Decode TPS |
|---|
| Mac M4 Pro | 582 tok/s | 100 tok/s |
| iPad/Mac M3 | 350 tok/s | 60 tok/s |
| iPhone 17 Pro | 327 tok/s | 48 tok/s |
| iPhone 13 Mini | 148 tok/s | 34 tok/s |
| Galaxy S25 Ultra | 255 tok/s | 37 tok/s |
| Pixel 6a | 70 tok/s | 15 tok/s |
| Raspberry Pi 5 | 69 tok/s | 11 tok/s |
Tips:
- Prefill is bottleneck for long prompts — Use chunked prefill
- NPU acceleration — 5-11x faster prefill on Apple devices (v1.15+)
- Cactus Attention — Makes long prefill as fast as short (v1.9+)
TPS Throttling
Limit maximum tokens-per-second to reduce power consumption and thermal throttling.
Setting Max TPS
// Limit to 30 tokens/second
const char* options = R"({
"max_tps": 30.0
})";
cactus_complete(model, messages, response, sizeof(response),
options, nullptr, nullptr, nullptr);
Use cases:
- Streaming UI — Match typing speed (~20-30 TPS feels natural)
- Power saving — Reduce battery drain on mobile
- Thermal management — Prevent device overheating during long sessions
Trade-offs:
| Max TPS | Power Usage | User Experience |
|---|
| Unlimited | Highest | Fastest response |
| 50 | High | Still feels instant |
| 30 | Medium | Smooth typing speed |
| 20 | Low | Comfortable reading pace |
| 10 | Lowest | Noticeably slow |
Set max_tps: -1.0 (default) for unlimited speed. The engine will generate as fast as hardware allows.
Memory Optimization
-
Use INT4 quantization
cactus convert Qwen/Qwen3-0.6B ./model --precision INT4
- 50% smaller weights vs INT8
- Minimal quality loss
-
Reduce KV cache window
engine->set_cache_window(512, 4); // 512 tokens instead of 2048
- 4x less cache memory
- Shorter effective context
-
Use smaller models
- Gemma3-270m: ~120MB RAM
- Qwen3-0.6B: ~200MB RAM
- LFM2.5-1.2B: ~400MB RAM
-
Free memory between sessions
engine->reset_cache(); // Clear KV cache
// Or destroy and recreate model
cactus_destroy(model);
model = cactus_init(model_path, nullptr, false);
Memory Benchmarks
From the Cactus README benchmarks:
INT4 Models:
| Model | iPhone 17 Pro | Galaxy S25 Ultra | Raspberry Pi 5 |
|---|
| LFM 1.2B | 108MB | 1.5GB | 869MB |
| LFMVL 1.6B | 108MB | 1.5GB | 869MB |
| Parakeet 1.1B | 108MB | 1.5GB | 869MB |
Android uses more RAM: Android’s memory overhead is higher than iOS for the same model. This is due to JNI/JVM overhead and different memory allocators. Use smaller models or INT4 quantization on Android.
Monitoring Memory Usage
// Memory usage included in completion response
const char* response = cactus_complete(/* ... */);
// Parse JSON response
// {
// "ram_usage_mb": 245.67,
// "prefill_tokens": 28,
// "decode_tokens": 50,
// ...
// }
Benchmarking
Use the --benchmark flag to measure performance on your specific hardware.
Command-Line Benchmarks
# Benchmark with default model
cactus test --benchmark
# Benchmark specific model
cactus test --model ./my-model --benchmark
# Benchmark on connected iPhone
cactus test --model ./my-model --benchmark --ios
# Benchmark on connected Android
cactus test --model ./my-model --benchmark --android
Metrics Reported
{
"success": true,
"response": "Generated text...",
"confidence": 0.8193,
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}
Key metrics:
time_to_first_token_ms — How long until first token appears (latency)
prefill_tps — Prompt processing speed (tokens/sec)
decode_tps — Generation speed (tokens/sec)
ram_usage_mb — Peak memory usage
// Profile prefill with trace file
std::vector<uint32_t> tokens = tokenizer->encode("Prompt");
engine->prefill(tokens, 256, "prefill_trace.json");
// Analyze trace file to identify bottlenecks
The trace file contains:
- Per-layer timings
- Memory allocations
- NPU vs CPU split
- Cache hit rates
Optimization Recipes
For Speed (High-End Devices)
// Large cache window for long context
engine->set_cache_window(4096, 4);
// Large prefill chunks
engine->prefill(tokens, 512);
// No TPS limit
const char* options = R"({"max_tps": -1.0})";
// Use INT8 for quality
// cactus convert <model> --precision INT8
For Memory (Budget Devices)
// Small cache window
engine->set_cache_window(512, 4);
// Small prefill chunks
engine->prefill(tokens, 128);
// Use INT4 quantization
// cactus convert <model> --precision INT4
// Consider smaller base model (Gemma3-270m, Qwen3-0.6B)
For Battery Life (Mobile)
// Moderate cache
engine->set_cache_window(1024, 4);
// Throttle TPS to reduce power
const char* options = R"({"max_tps": 30.0})";
// NPU acceleration (iOS/macOS)
engine->load_npu_prefill("./model/npu_prefill.mlmodelc");
// INT4 quantization for less memory bandwidth
For Quality (Critical Applications)
// Large cache for full context
engine->set_cache_window(4096, 4);
// INT8 or FP16 quantization
// cactus convert <model> --precision INT8
// No TPS throttling
const char* options = R"({"max_tps": -1.0})";
// Larger base model (LFM2.5-1.2B, Qwen3-1.7B)
Advanced Techniques
Cactus Attention (v1.9+)
Automatic optimization that makes long prefill as fast as short prefill for decode.
Enabled by default — No configuration needed!
Impact: 10-token prefill and 1k-token prefill have the same decode speed.
Hybrid Inference (v1.15+)
Automatic blend of NPU (prefill) and CPU (decode) for optimal performance.
Enabled automatically on devices with compatible NPUs (Apple Neural Engine, coming to Qualcomm/MediaTek).
Impact: 5-11x faster prefill on iOS/macOS.
Lossless Quantization (v1.15+)
Advanced quantization techniques that maintain quality while improving speed.
Enabled by default with INT4/INT8 quantization.
Impact: 1.5x speedup vs naive quantization.
See Also