Architecture
Cactus is designed as a three-layer architecture that separates high-level AI workflows from low-level hardware optimizations. This design enables efficient on-device inference across diverse mobile hardware while maintaining clean, maintainable code.Three-Layer Design
Cactus is built on a modular architecture that cleanly separates concerns:Engine Layer (High-Level API)
Engine Layer (High-Level API)
The Engine Layer provides developer-friendly APIs for common AI tasks:
- Text Completion: Chat, instruction following, conversation
- Vision: Image understanding, multi-modal inputs
- Transcription: Speech-to-text (Whisper, Parakeet, Moonshine)
- Embeddings: Text and image embeddings for RAG
- Tool Calling: Function calling with JSON schema validation
- Cloud Handoff: Automatic fallback to cloud models based on confidence
- OpenAI-compatible message format
- Streaming token generation
- RAG with automatic vector indexing
- Multi-language SDKs (Python, Swift, Kotlin, Rust, Flutter)
Graph Layer (Computation Graph)
Graph Layer (Computation Graph)
The Graph Layer is a zero-copy computation graph framework inspired by PyTorch:The graph layer handles precision conversions, broadcasting, and operator fusion automatically.See the Graph API Reference for complete documentation.
- Node-Based Operations: Tensor operations as graph nodes
- Lazy Execution: Build graph first, execute later
- Memory Efficiency: Zero-copy operations, smart buffer pooling
- Precision Support: INT4, INT8, FP16, FP32
- Mixed Precision: Automatic precision casting where needed
Kernel Layer (Hardware Optimizations)
Kernel Layer (Hardware Optimizations)
The Kernel Layer contains highly optimized ARM SIMD implementations:
- INT4/INT8 Quantized Operations: Group quantization with FP16 scales
- Custom Attention: Optimized for mobile memory hierarchies
- KV Cache Quantization: INT8 cache compression
- NEON/SME2 SIMD: ARM vector intrinsics for maximum throughput
- Cache-Friendly Access: Optimized memory access patterns
cactus_matmul_int4(): INT4 matrix multiplicationcactus_matmul_int8(): INT8 matrix multiplication with group scalescactus_attention_f16(): FP16 attention with causal maskingcactus_attention_hybrid_int8_fp16(): Hybrid KV cache attentioncactus_quantize_kv_fp16_to_int8(): KV cache quantization
cactus/kernel/ and optimized for Apple, Qualcomm, and Samsung processors.Hybrid NPU/CPU Execution
Cactus intelligently distributes computation between NPU (Neural Processing Unit) and CPU for optimal performance:- Apple NPU
- Qualcomm NPU
- CPU Fallback
Supported Operations:
Models with NPU Support:
- Matrix multiplication (up to 11x faster than CPU)
- Convolution layers
- Attention mechanisms
- Layer normalization
- Available on A14+ (iPhone 12+) and M1+ Macs
- FP16 precision
- Automatic chunked prefill for long sequences
- Zero-copy integration with CPU execution
| Device | CPU Only | With NPU | Speedup |
|---|---|---|---|
| iPhone 17 Pro | 48 t/s | 327 t/s | 6.8x |
| Mac M4 Pro | 100 t/s | 582 t/s | 5.8x |
| iPad M3 | 60 t/s | 350 t/s | 5.8x |
- LFM2-VL (vision models)
- Whisper (all sizes)
- Parakeet transcription models
Key Architectural Features
Chunked Prefill
Cactus processes long input sequences in chunks to maintain consistent decode speeds:- Long prefill (1000 tokens) = same decode speed as short prefill (10 tokens)
- Prevents memory pressure during prefill
- Enables streaming responses sooner
KV Cache Quantization
Cactus compresses the key-value cache using INT8 quantization with group scales:- 2x reduction in cache size (FP16 → INT8)
- No accuracy loss with group quantization
- Critical for long context on mobile devices
Cactus Attention
Custom attention implementation optimized for mobile:Standard Attention
Standard Attention
- Fused softmax and matrix multiply
- Cache-friendly memory access
- Multi-head attention with GQA (Grouped Query Attention)
Windowed Attention
Windowed Attention
Sliding window attention for long sequences:Maintains first 4 tokens (sink) + sliding window of 1024 tokens.
Memory-Mapped Weights
Weights are memory-mapped for efficient loading:- Fast model loading (no copying into RAM)
- Only active weights in memory
- OS handles paging automatically
- Supports models larger than RAM
Related Resources
Models
Supported models and their features
Quantization
INT4/INT8/FP16 precision options
Engine API
High-level API reference
Graph API
Computation graph reference