Headroom sits between your application and the LLM provider. It intercepts outgoing messages, compresses them through a deterministic pipeline, and forwards the optimized request. The provider’s response comes back unchanged. Every entry point — SDK wrapper, proxy, and framework integrations — feeds into the same canonical pipeline with the same lifecycle guarantees.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
Full Pipeline Diagram
Three Entry Points
All three entry points share the same pipeline and lifecycle. Choose based on how much control you want and how many code changes you’re willing to make.SDK Mode
Wrap your LLM client with
HeadroomClient. Minimal code changes — swap the client constructor.Proxy Mode
Run
headroom proxy --port 8787 and point your existing client at localhost:8787. Zero code changes — just update the base URL.Integrations
Drop-in adapters for LangChain, Vercel AI SDK, Agno, LiteLLM, and ASGI middleware. Framework-specific setup, same pipeline underneath.
The Transform Pipeline
Headroom exposes one stable request lifecycle acrosscompress(), the SDK, and the proxy. Every request transitions through these stages in order:
PipelineStage enum values in headroom/pipeline.py:
| Stage | What happens |
|---|---|
SETUP | Pipeline initialization, provider detection |
PRE_START | Pre-flight checks, rate limiting |
POST_START | Session context loaded |
INPUT_RECEIVED | Raw messages ingested; extension hooks may rewrite them |
INPUT_CACHED | Provider KV cache state checked; frozen prefix identified |
INPUT_ROUTED | ContentRouter assigns a compressor to each message block |
INPUT_COMPRESSED | Compressors run; CCR originals cached; metrics captured |
INPUT_REMEMBERED | Cross-agent memory consulted; SharedContext updated |
PRE_SEND | Final message array assembled; provider headers added |
POST_SEND | Request sent to provider; streaming begins |
RESPONSE_RECEIVED | Response intercepted; CCR tool calls handled |
Stage 1: CacheAligner (prefix stabilization)
CacheAligner runs first, before any compression, so it can measure the stable prefix before it’s potentially altered. It detects volatile tokens (UUIDs, ISO 8601 timestamps, JWTs, hex hashes) in system messages and emits warnings. The prompt itself is never modified.
Stage 2: SmartCrusher / ContentRouter (content compression)
ContentRouter dispatches each message block to the optimal compressor (see How Compression Works). This is where the bulk of token savings happen:
- SmartCrusher for JSON arrays: field-level statistical analysis, Kneedle algorithm, anomaly/error preservation.
- CodeAwareCompressor for source code: AST parsing via tree-sitter, syntax-valid output.
- LogCompressor for build/test logs: pattern clustering, error line preservation.
- Kompress-v2-base for plain text: ONNX INT8 ModernBERT inference.
Provider-Specific Behavior
Provider and tool-specific behavior is isolated underheadroom/providers/ so the core orchestration stays focused on lifecycle and policy.
client.py, cli/proxy.py, proxy/server.py) delegate entirely to provider slices.
Pipeline Extensions
Headroom exposes two extension seams for customizing the lifecycle without forking core code.on_pipeline_event() — PipelineExtension protocol
Register an extension that receives every PipelineEvent as it fires:
headroom.pipeline_extension) or passed directly:
CompressionHooks — additional extension seam
CompressionHooks sits alongside the canonical lifecycle as a complementary seam. It provides pre_compress, post_compress, and compute_biases callbacks for finer-grained control over the compression step itself:
Proxy Architecture
The proxy is a FastAPI + Uvicorn server that acts as a drop-in OpenAI/Anthropic-compatible endpoint. Upstream requests are forwarded via an httpx async client.What Headroom Does Not Touch
- User messages — never compressed by default (
compress_user_messages=False) - System prompt content — preserved exactly; CacheAligner only emits warnings
- Code — passes through unchanged unless
CodeAwareCompressoris explicitly enabled (pip install "headroom-ai[code]") - Short content — tool outputs under
min_tokens_to_compress(default: 250 tokens) pass through unchanged - Model responses — returned to your client exactly as received from the provider
- Excluded tools —
Read,Glob,Grep,Write, andEdittool outputs are excluded from compression by default (they contain exact file content needed for edit workflows)