NPU Acceleration
Cactus leverages Neural Processing Units (NPUs) on mobile devices to accelerate inference by 5-11x. This guide covers NPU support, performance gains, and how to enable/disable NPU backends.
Supported NPUs
Apple Neural Engine (iOS/macOS)
Status: β
Shipped (Jan 2026, v1.15+)
Devices:
- iPhone 12 and newer (A14 Bionic+)
- iPad Air (4th gen) and newer (M1+)
- iPad Pro (3rd gen) and newer (A12X+)
- Mac with Apple Silicon (M1, M2, M3, M4)
- Apple Watch Series 6+ (S6+)
Supported Models:
- Vision models:
LFM2-VL-450M, LFM2.5-VL-1.6B
- Speech models:
whisper-tiny, whisper-base, whisper-small, whisper-medium, parakeet-ctc-0.6b, parakeet-ctc-1.1b, parakeet-tdt-0.6b-v3
Performance Gains:
- 5-11x faster prefill on iOS/macOS
- Vision: 0.2-0.3s first token latency (vs 2-5s CPU)
- Speech: 0.1-0.7s transcription latency (vs 1-10s CPU)
Qualcomm Hexagon DSP (Android)
Status: π§ Coming Soon (Mar 2026, v1.16+)
Devices:
- Snapdragon 8 Gen 1 and newer
- Snapdragon 7 Gen 1 and newer (select models)
Expected Performance: 5-11x faster prefill on Android flagships
Status: π
Planned (Apr 2026)
Devices:
- Dimensity 9000 series
- Dimensity 8000 series
Google Tensor (Android)
Status: π
Planned (Mar 2026, v1.16+)
Devices:
- Pixel 6, 7, 8, 9 series (TPU)
From the Cactus README benchmarks table:
With Apple NPU
LFM2.5-VL-1.6B (Vision model, 256px input):
| Device | First Token (NPU) | Decode TPS | RAM |
|---|
| Mac M4 Pro | 0.2s | 98 tok/s | 76MB |
| iPad/Mac M3 | 0.3s | 69 tok/s | 70MB |
| iPhone 17 Pro | 0.3s | 48 tok/s | 108MB |
| iPhone 13 Mini | 0.3s | 35 tok/s | 1GB |
Parakeet-1.1B (Speech model, 30s audio):
| Device | Transcription (NPU) | Decode TPS | RAM |
|---|
| Mac M4 Pro | 0.1s | 900k+ tok/s | 76MB |
| iPad/Mac M3 | 0.3s | 800k+ tok/s | 70MB |
| iPhone 17 Pro | 0.3s | 300k+ tok/s | 108MB |
| iPhone 13 Mini | 0.7s | 90k+ tok/s | 1GB |
LFM 1.2B (Text model, 1k prefill / 100 decode):
| Device | Prefill TPS | Decode TPS | RAM |
|---|
| Mac M4 Pro | 582 tok/s | 100 tok/s | 76MB |
| iPad/Mac M3 | 350 tok/s | 60 tok/s | 70MB |
| iPhone 17 Pro | 327 tok/s | 48 tok/s | 108MB |
| iPhone 13 Mini | 148 tok/s | 34 tok/s | 1GB |
Without NPU (Android)
Note: NPU support coming Mar 2026 for Qualcomm/Google, Apr 2026 for MediaTek.
| Device | LFM 1.2B | RAM |
|---|
| Galaxy S25 Ultra | 255/37 tok/s | 1.5GB |
| Pixel 6a | 70/15 tok/s | 1GB |
| Galaxy A17 5G | 32/10 tok/s | 727MB |
Expected 5-11x prefill speedup once NPU support ships.
How NPU Acceleration Works
Cactus uses a hybrid approach:
- NPU for Prefill β Encoder and prefill transformer layers run on NPU
- CPU/SIMD for Decode β Token-by-token generation uses ARM SIMD kernels
- Zero-Copy Handoff β Seamless transitions between NPU and CPU
Why Hybrid?
- NPU excels at batch processing β Prefill processes many tokens at once
- CPU excels at autoregressive decode β Single token generation is memory-bound
- Best of both worlds β 5-11x faster prefill, no decode slowdown
Enabling NPU Backend
Automatic (Default)
NPU is automatically enabled if:
- Device has compatible NPU hardware
- Model supports NPU acceleration
- Cactus runtime includes NPU support
No configuration needed β it just works!
Checking NPU Availability
#include "cactus/npu/npu.h"
if (cactus::npu::is_npu_available()) {
printf("NPU acceleration available\n");
} else {
printf("NPU not available, using CPU fallback\n");
}
Loading NPU Prefill Model
#include "cactus/engine/engine.h"
cactus::Engine* engine = /* ... load model ... */;
// Load NPU-optimized prefill weights
if (engine->load_npu_prefill("./model/npu_prefill.mlmodelc")) {
printf("NPU prefill loaded, chunk size: %zu\n",
engine->get_prefill_chunk_size());
} else {
printf("NPU prefill not available, using CPU\n");
}
Disabling NPU Backend
Currently, NPU is automatically used when available. Manual disable is not exposed in the public API.
Why? NPU provides significant speedups with no quality loss. Disabling would only hurt performance.
Future: Advanced users may get a config option to force CPU-only mode for debugging.
Configuration Options
Prefill Chunk Size
NPU prefill processes input in chunks (default: 256 tokens).
// Get current chunk size
size_t chunk_size = engine->get_prefill_chunk_size();
printf("Prefill chunk size: %zu\n", chunk_size);
// Chunk size is determined by NPU model architecture
// Not configurable at runtime
Chunk size affects:
- Memory usage β Larger chunks use more NPU memory
- Latency β Smaller chunks add overhead
- Throughput β 256 is optimal for most models
The chunk size is baked into the NPU model file (.mlmodelc) and cannot be changed at runtime. Itβs optimized for each model during conversion.
KV Cache Window
NPU prefill respects KV cache window settings:
// Set sliding window cache (e.g., 2048 tokens)
engine->set_cache_window(2048, 4); // window_size, sink_size
// NPU prefill will process chunks up to window limit
See Performance Tuning for KV cache configuration.
iOS Requirements
- iOS 14.0+ for Neural Engine support
- iOS 15.0+ recommended (better ANE APIs)
- A12 Bionic or newer (iPhone XS, XR, and later)
macOS Requirements
- macOS 11.0+ (Big Sur) for Apple Silicon
- M1 or newer (M1, M2, M3, M4)
- Intel Macs not supported (no Neural Engine)
Android Requirements (Coming Soon)
- Android API 29+ (Android 10) for Hexagon DSP
- Snapdragon 8 Gen 1+ or Dimensity 9000+
- NNAPI or QNN runtime installed
Model Compatibility
NPU-Accelerated Models
From the Supported Models table:
Vision Models:
LiquidAI/LFM2-VL-450M β Vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B β Vision, txt & img embed, Apple NPU
Speech Models:
openai/whisper-tiny β Transcription, speech embed, Apple NPU
openai/whisper-base β Transcription, speech embed, Apple NPU
openai/whisper-small β Transcription, speech embed, Apple NPU
openai/whisper-medium β Transcription, speech embed, Apple NPU
nvidia/parakeet-ctc-0.6b β Transcription, speech embed, Apple NPU
nvidia/parakeet-ctc-1.1b β Transcription, speech embed, Apple NPU
nvidia/parakeet-tdt-0.6b-v3 β Transcription, speech embed, Apple NPU
Text Models:
- Most LLMs use NPU for prefill (when available)
- Decode always uses CPU SIMD kernels
Models Without NPU Support
google/gemma-3-* β CPU-only (for now)
Qwen/Qwen3-* β CPU-only (for now)
- Embedding-only models β CPU-only
- VAD models β CPU-only
Custom fine-tuned models inherit NPU support from their base model. If the base model supports NPU, your fine-tune will too β no extra work required!
Troubleshooting
NPU Not Available
Check device compatibility:
if (!cactus::npu::is_npu_available()) {
// Device doesn't have compatible NPU, or
// NPU support not compiled into runtime
}
Common reasons:
- Older device (pre-A12 iPhone, Intel Mac)
- Android device (NPU coming Mar 2026)
- Simulator build (NPU only on physical devices)
Slow Despite NPU
- Check model supports NPU β Not all models have NPU variants
- Verify NPU prefill loaded β Call
load_npu_prefill() explicitly
- Measure properly β Use
--benchmark flag for accurate timing
- Thermal throttling β Device may throttle under sustained load
Memory Issues
NPU models use additional memory:
- Vision models: +100-200MB for NPU weights
- Speech models: +50-100MB for NPU weights
- Text models: +20-50MB for NPU prefill cache
If running out of memory:
- Reduce KV cache window size
- Use smaller model variant
- Disable NPU prefill (not recommended)
Roadmap
Q1 2026 (Shipped)
- β
Apple Neural Engine support (Jan 2026)
- β
Vision model NPU acceleration
- β
Speech model NPU acceleration
- β
5-11x prefill speedup on iOS/macOS
Q1-Q2 2026 (Coming)
- π§ Qualcomm Hexagon DSP (Mar 2026)
- π§ Google Tensor TPU (Mar 2026)
- π
MediaTek APU (Apr 2026)
- π
Samsung Exynos NPU (Apr 2026)
Q2-Q3 2026 (Planned)
- π
Mac GPU acceleration (May 2026)
- π
Wearables optimizations (Jul 2026)
- π
Additional model NPU support
See Also