Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
headroom proxy starts the Headroom optimization proxy — an HTTP server that sits between your AI agent and the upstream LLM provider. Every request that passes through it has its message context compressed before being forwarded, reducing token spend and latency with no changes to your agent code.
For one-command setup that starts the proxy AND launches an agent, use
headroom wrap <tool> instead.Core options
Host to bind the server to. Set to
0.0.0.0 for Docker/remote access. Env: HEADROOM_HOST.Port to bind to (1–65535). Env:
HEADROOM_PORT.Number of Uvicorn worker processes. Increase for high-concurrency deployments. Env:
HEADROOM_WORKERS.Optimization mode.
token (default) prioritizes maximum token reduction; prior turns may be rewritten. cache freezes prior turns to maximize provider prefix-cache hit rates. Legacy aliases token_mode, cache_mode, token_savings, cost_savings, and token_headroom are still accepted. Env: HEADROOM_MODE.Disable all compression — run as a passthrough proxy only. Useful for debugging or measuring baseline usage.
Disable the semantic response cache.
Override the Kompress keep-ratio for prose/code compression. Lower is more aggressive (e.g.
0.4 keeps ~40 % of tokens). Unset by default — Kompress decides via its own importance threshold. Env: HEADROOM_TARGET_RATIO.Logging
Path to write request/response logs as JSONL. Each line is a JSON object with fields:
timestamp, request_id, model, tokens_before, tokens_after, latency_ms, etc. Disabled in --stateless mode. Env: HEADROOM_LOG_FILE.Enable full message logging — request/response content is stored in the log file. Warning: may log sensitive data. Env:
HEADROOM_LOG_MESSAGES.Budget & rate limiting
Spend cap in USD per
--budget-period. Requests are rejected with HTTP 429 once the limit is reached. Env: HEADROOM_BUDGET.Period the
--budget limit applies to. hourly resets on a rolling hour; daily at local midnight; monthly on the 1st. Env: HEADROOM_BUDGET_PERIOD.Disable per-minute rate limiting entirely.
Max requests per minute. Default: 60. Has no effect with
--no-rate-limit. Env: HEADROOM_RPM.Max tokens per minute. Default: 100,000. Has no effect with
--no-rate-limit. Env: HEADROOM_TPM.Upstream provider routing
Custom OpenAI API URL for passthrough endpoints. Env:
OPENAI_TARGET_API_URL.Custom Anthropic API URL for passthrough endpoints. Env:
ANTHROPIC_TARGET_API_URL.Custom Gemini API URL for passthrough endpoints. Env:
GEMINI_TARGET_API_URL.API backend:
anthropic (direct), bedrock (AWS), openrouter, anyllm, or litellm-<provider> (e.g. litellm-vertex). Env: HEADROOM_BACKEND.Custom Bedrock InvokeModel upstream. Point at a re-signing gateway, not raw AWS. Env:
BEDROCK_TARGET_API_URL.Cloud region for Bedrock/Vertex/etc. Env:
HEADROOM_REGION.CCR (Compress-Cache-Retrieve)
Don’t inject the
headroom_retrieve tool into requests. Use for streaming or non-MCP clients that can’t resolve the retrieve tool. Env: HEADROOM_NO_CCR_INJECT_TOOL.Don’t add CCR retrieval markers to compressed content. Env:
HEADROOM_NO_CCR_MARKER.No-CCR lossless mode: compress tool outputs with format-native lossless compaction without emitting any CCR retrieval marker or needing the MCP retrieve tool. Env:
HEADROOM_LOSSLESS=1.Opt in to tool-result interceptors (AST-grep Read outliner, etc.). Off by default. Requires
headroom-ai[tools] extras.Kompress & tool protection
Disable Kompress ML compression while keeping structural compression (ToolCrusher, SmartCrusher, CacheAligner) active. Env:
HEADROOM_DISABLE_KOMPRESS=1.Comma-separated tool names whose results are never lossy-compressed. Merged with built-in defaults (e.g.
Bash,WebFetch). Env: HEADROOM_PROTECT_TOOL_RESULTS.Disable proactive expansion of previously compressed content. Env:
HEADROOM_NO_CCR_PROACTIVE_EXPANSION.Read lifecycle
Disable Read lifecycle management. By default the proxy compresses stale and superseded Read tool outputs to reclaim context.
Code-aware compression
Enable or disable AST-based code compression. Requires
pip install headroom-ai[code]. Default: disabled. Env: HEADROOM_CODE_AWARE_ENABLED=1.Index the current working directory and watch for file changes via codebase-memory-mcp. Useful when the proxy is started from a project root.
Memory
Enable persistent memory. Auto-detects the provider (Anthropic, OpenAI, Gemini) and uses the appropriate tool format. By default each workspace gets its own SQLite database.
Memory partitioning strategy.
project (default): one DB per resolved workspace. user: one DB per x-headroom-user-id. global: a single shared DB (pre-existing behavior).Number of semantically relevant memories to inject as context (1–100). Env:
HEADROOM_MEMORY_TOP_K.Disable automatic injection of
memory_save/memory_search tools. Env: HEADROOM_NO_MEMORY_TOOLS.Disable automatic injection of relevant past memories into the system prompt. Env:
HEADROOM_NO_MEMORY_CONTEXT.Traffic learning
Enable live traffic learning: extract error→recovery patterns, environment facts, and user preferences from proxy traffic. Implies
--memory. Learned patterns are saved to agent-native memory files (CLAUDE.md, .cursor/rules, AGENTS.md).Explicitly disable traffic learning even when
--memory is set.Minimum number of times a pattern must be observed before it is persisted. Default: 5. Higher values reduce one-shot noise. Env:
HEADROOM_MIN_EVIDENCE.Connection tuning
Maximum concurrent connections before Uvicorn returns 503. Env:
HEADROOM_LIMIT_CONCURRENCY.Maximum upstream HTTP connections. Env:
HEADROOM_MAX_CONNECTIONS.Maximum upstream keep-alive connections. Env:
HEADROOM_MAX_KEEPALIVE.Request timeout in seconds. Useful for slow local providers. Env:
HEADROOM_REQUEST_TIMEOUT.Maximum upstream retry attempts for connect/read/5xx failures (1–10). Env:
HEADROOM_RETRY_MAX_ATTEMPTS.Deployment & security
Opt in to anonymous usage telemetry (off by default). Env:
HEADROOM_TELEMETRY=on.Explicitly disable telemetry. Env:
HEADROOM_TELEMETRY=off.Disable all filesystem writes — run purely in-memory. For containerized, read-only, or load-balanced deployments. Memory, TOIN, and log file persistence are all disabled. Env:
HEADROOM_STATELESS=true.