LLM providers cache the attention states (KV cache) computed for your prompt prefix, and re-using that cache can cut your per-request cost by 50–90%. If the prefix is byte-identical across requests, the provider skips recomputation and charges you at a steep discount — up to 90% off on Anthropic. But a single dynamic token anywhere in the prefix (a timestamp, a session UUID, a JWT) invalidates the entire cache entry and resets your bill to full price. Headroom’s CacheAligner detects those volatile tokens and surfaces them as warnings so you know when your cache prefix is unstable. When combined with the provider-specific CacheOptimizer classes, Headroom can also insert explicit cache breakpoints and manage CachedContent lifecycles on your behalf.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
Why Prefix Stability Matters
Anthropic
90% discount on cached input tokens. Cache write costs 25% extra on first write; subsequent reads are 10× cheaper than regular input.
OpenAI
50% discount via automatic prefix caching. No API markers needed — just keep the prefix byte-identical across requests (min 1,024 tokens).
75% discount via the explicit CachedContent API. Minimum cache size: 32,768 tokens.
"Current Date: 2025-06-15" in your system prompt changes every day. Every day is a full-price request for every user — even if the rest of your 50,000-token system prompt is identical.
CacheAligner: Dynamic Content Detection
CacheAligner is a detector-only transform. It scans system messages for volatile tokens and emits warnings — it never mutates the prompt (mutating the cached prefix would immediately bust the cache it’s trying to protect).
DynamicContentDetector patterns
Whenuse_dynamic_detector=True (default), the detector uses 15+ structural patterns organized into detection tiers:
- regex (default, ~0ms)
- ner (optional, ~5–10ms)
- semantic (optional, ~20–50ms)
Fast structural / universal patterns that catch the most common sources of cache instability:
| Pattern | Example |
|---|---|
| UUIDs (RFC 4122) | 550e8400-e29b-41d4-a716-446655440000 |
| API keys / tokens | sk-abc123..., api_key_xyz... |
| JWT tokens | eyJhbGciOiJIUzI1NiIs... (3 dot-separated base64url segments) |
| Unix timestamps | 1705312847 |
| Request / trace IDs | req_abc123, trace_xyz789 |
| Hex hashes | MD5 (32 chars), SHA1 (40 chars), SHA256 (64 chars) |
| ISO 8601 dates | 2025-06-15, 2025-06-15T14:30:00Z |
| Version numbers | v1.2.3, v2.0.0-beta |
| Structural labels | "Session: abc123", "User: john@example.com" |
| High-entropy strings | Random-looking alphanumeric sequences above entropy_threshold |
CacheAlignerConfig
Full configuration reference fromheadroom/config.py:
Provider-Specific Cache Optimizers
CacheOptimizerConfig complements CacheAligner. While CacheAligner detects volatile tokens, the CacheOptimizer classes apply the provider’s native caching mechanism to your stable prefix.
- Anthropic
- OpenAI
- Google
AnthropicCacheOptimizer inserts explicit
No API changes needed on your side — the optimizer injects the breakpoints automatically before forwarding the request.
cache_control breakpoints at the right positions in your messages so stable prefixes (system prompt, early conversation turns) are cached across requests.| Metric | Value |
|---|---|
| Cache read discount | 90% off input price |
| Cache write cost | 25% premium on first write |
| Cache TTL | 5 minutes (extended on each hit) |
--mode cache Proxy Flag
The proxy’s --mode cache flag maximizes prefix hit rates for long-running agentic sessions by freezing prior turns. Once a turn’s messages have been sent to the provider and cached, they are treated as immutable — the compression pipeline skips them on subsequent requests.
PrefixFreezeConfig internally:
When Cache Optimization Pays Off
Cache optimization has the largest impact when:- The same system prompt is reused across many requests — long system prompts with tool definitions, instructions, or RAG content that doesn’t change between turns.
- Conversations are long — the prefix grows with each turn, and re-caching it every request becomes increasingly expensive.
- You use expensive models — Anthropic’s 90% read discount on Opus-class models translates directly to dollar savings.