Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

The Headroom proxy is a standalone local HTTP server that compresses every LLM request passing through it. Point any existing client at the proxy and get automatic context optimization without touching your application code. It speaks both the Anthropic and OpenAI wire formats and proxies transparently to the upstream provider.

Getting started

1

Install Headroom

pip install "headroom-ai[all]"
Requires Python 3.10+. For the proxy only (lighter install): pip install "headroom-ai[proxy]".
2

Start the proxy

headroom proxy                    # binds to 127.0.0.1:8787
headroom proxy --port 8080        # custom port
headroom proxy --host 0.0.0.0     # listen on all interfaces (add HEADROOM_PROXY_TOKEN for non-loopback)
3

Point your client at the proxy

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 your-app
No other changes required — the proxy forwards your API key to the upstream provider unchanged.
4

Verify savings

headroom doctor          # health check — confirms routing is working
headroom dashboard       # live savings dashboard in your browser
curl http://localhost:8787/stats

CLI options

Core options

OptionDefaultEnv varDescription
--host127.0.0.1HEADROOM_HOSTHost to bind to
--port8787HEADROOM_PORTPort to bind to
--workers1HEADROOM_WORKERSNumber of Uvicorn worker processes
--limit-concurrency1000HEADROOM_LIMIT_CONCURRENCYMax concurrent connections before Uvicorn returns 503
--max-connections500HEADROOM_MAX_CONNECTIONSMaximum upstream HTTP connections
--max-keepalive100HEADROOM_MAX_KEEPALIVEMaximum upstream keep-alive connections
--no-optimizefalseDisable compression (passthrough mode)
--no-cachefalseDisable semantic caching
--no-rate-limitfalseDisable rate limiting
--log-fileNoneHEADROOM_LOG_FILEPath to write request/response logs as JSONL
--budgetNoneHEADROOM_BUDGETBudget limit in USD per --budget-period
--budget-perioddailyHEADROOM_BUDGET_PERIODPeriod: hourly, daily, or monthly
--backendanthropicHEADROOM_BACKENDBackend: anthropic, bedrock, openrouter, anyllm, or litellm-<provider>
--statelessfalseHEADROOM_STATELESSDisable all filesystem writes — for containerized deployments
--telemetryfalseHEADROOM_TELEMETRYOpt in to anonymous usage telemetry (off by default)

Context management

OptionDefaultEnv varDescription
--mode tokendefaultHEADROOM_MODEPrioritize token compression; prior turns may be rewritten for maximum savings
--mode cacheHEADROOM_MODEFreeze prior turns to maximize provider prefix-cache hit rate
--target-ratioNoneHEADROOM_TARGET_RATIOOverride Kompress keep-ratio (e.g. 0.4 keeps ~40 % of text tokens)
--intercept-tool-resultsfalseOpt into tool-result interceptors such as ast-grep Read outlining
--no-read-lifecyclefalseDisable stale/superseded Read-output compression
--code-aware / --no-code-awaredisabledHEADROOM_CODE_AWARE_ENABLEDEnable AST-based code compression (requires headroom-ai[code])
--losslessfalseHEADROOM_LOSSLESSFormat-native lossless compaction without CCR markers
--no-ccr-inject-toolfalseHEADROOM_NO_CCR_INJECT_TOOLDon’t inject the headroom_retrieve CCR tool
--protect-tool-resultsHEADROOM_PROTECT_TOOL_RESULTSComma-separated tool names whose results are never lossy-compressed

Custom API targets

OptionEnv varDescription
--anthropic-api-urlANTHROPIC_TARGET_API_URLCustom Anthropic API URL for passthrough endpoints
--openai-api-urlOPENAI_TARGET_API_URLCustom OpenAI API URL for passthrough endpoints
--gemini-api-urlGEMINI_TARGET_API_URLCustom Gemini API URL
--bedrock-api-urlBEDROCK_TARGET_API_URLBedrock InvokeModel upstream for /model/{id}/invoke passthrough routes
--regionHEADROOM_REGIONCloud region for Bedrock/Vertex (default: us-west-2)

Memory and learning

OptionDefaultDescription
--memoryfalseEnable persistent memory with provider-appropriate tool injection
--memory-storageprojectPartitioning: project (per-workspace DB), user, or global
--memory-top-k10Number of semantically relevant memories to inject as context
--no-memory-toolsfalseDisable automatic memory tool injection
--no-memory-contextfalseDisable automatic memory context injection
--learnfalseEnable live traffic learning (implies --memory)
--min-evidence5Minimum observations before a learned pattern is persisted

Proxy modes: token vs cache

# token mode (default): maximize compression
# Prior turns may be rewritten for the best savings.
headroom proxy --mode token

# cache mode: preserve provider prefix-cache stability
# Prior turns are frozen; only new content is compressed.
headroom proxy --mode cache
Use token when you want maximum token savings. Use cache when you are working with long multi-turn sessions and want to keep Anthropic or OpenAI prefix-cache hit rates high to reduce latency and cost.

Output token reduction

Everything above reduces what you send to the LLM. The proxy can also trim what the model writes back — output tokens cost up to 5× input tokens on Opus-class models.
export HEADROOM_OUTPUT_SHAPER=1   # off by default
headroom proxy --port 8787
When HEADROOM_OUTPUT_SHAPER=1 is set the proxy:
  • Appends a short verbosity note to the tail of the system prompt (without disturbing your cache prefix).
  • Routes low-effort turns (tool-result continuations, file reads) to reduced thinking effort; new questions and errors keep full effort.
To measure output savings against a held-out control group:
export HEADROOM_OUTPUT_HOLDOUT=0.1   # 10 % of turns are unshaped
headroom dashboard                   # shows "Output Tokens Saved" with confidence band

Pointing clients at the proxy

# Claude Code (Anthropic format)
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# OpenAI SDK or any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 python my_app.py

# Codex
OPENAI_BASE_URL=http://localhost:8787/v1 codex
Your existing API keys are forwarded to the upstream provider unchanged. The proxy never stores or modifies credentials.

tokensave binary trust model

headroom wrap registers tokensave (a local code-graph MCP server) as the default coding-task compressor. tokensave ships as a prebuilt Rust binary; every supported release asset is pinned to a SHA-256 digest in Headroom (headroom/graph/tokensave_installer.py). The downloaded bytes are verified against that digest before extraction — a mismatch aborts the install and Headroom falls back to the Serena backup rather than running unverified code.
# Never reach the network — use an already-installed tokensave or fall back to Serena
HEADROOM_BINARIES_OFFLINE=1 headroom wrap claude

# Override the pinned release tag (requires HEADROOM_TOKENSAVE_ALLOW_UNVERIFIED=1)
HEADROOM_TOKENSAVE_VERSION=v1.2.3 headroom wrap claude

# Skip the primary compressor entirely
headroom wrap claude --no-tokensave

# Force the Serena backup
headroom wrap claude --serena

Stats and dashboard

API endpoints

EndpointDescription
GET /healthAggregate health and current session savings
GET /livezProcess liveness check
GET /readyzTraffic readiness check
GET /statsLive session stats plus durable persistent_savings totals
GET /stats-historyHourly, daily, weekly, and monthly savings rollups
GET /metricsPrometheus-format metrics
# Live stats
curl http://localhost:8787/stats

# Historical rollups (also supports ?format=csv&series=weekly)
curl "http://localhost:8787/stats-history"

# Prometheus scrape
curl http://localhost:8787/metrics

Dashboard

headroom dashboard               # opens http://localhost:8787/dashboard in your browser
headroom dashboard --no-open     # print the URL instead
headroom dashboard --port 8080   # if the proxy is on a non-default port
The dashboard shows input token savings, output token savings (if HEADROOM_OUTPUT_SHAPER=1), model usage breakdown, and dollar savings (requires Python 3.13 + LiteLLM).

Production deployment

pip install gunicorn

gunicorn headroom.proxy.server:app \
  --workers 4 \
  --bind 0.0.0.0:8787 \
  --worker-class uvicorn.workers.UvicornWorker

Environment variable reference

export HEADROOM_HOST=0.0.0.0
export HEADROOM_PORT=8787
export HEADROOM_MODE=token
export HEADROOM_BUDGET=100.0
export HEADROOM_BUDGET_PERIOD=daily
export HEADROOM_OUTPUT_SHAPER=1

# Custom upstream targets
export ANTHROPIC_TARGET_API_URL=https://litellm.company.internal
export OPENAI_TARGET_API_URL=https://custom.openai.endpoint.com

headroom proxy

Build docs developers (and LLMs) love