headroom proxy — Start the Compression Proxy Server

headroom proxy starts the Headroom optimization proxy — an HTTP server that sits between your AI agent and the upstream LLM provider. Every request that passes through it has its message context compressed before being forwarded, reducing token spend and latency with no changes to your agent code.

# Start on the default port 8787
headroom proxy

# OpenAI-compatible clients
OPENAI_BASE_URL=http://localhost:8787/v1 your-app

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

For one-command setup that starts the proxy AND launches an agent, use headroom wrap <tool> instead.

Core options

--host

string

default:"127.0.0.1"

Host to bind the server to. Set to 0.0.0.0 for Docker/remote access. Env: HEADROOM_HOST.

--port / -p

integer

default:"8787"

Port to bind to (1–65535). Env: HEADROOM_PORT.

--workers

integer

default:"1"

Number of Uvicorn worker processes. Increase for high-concurrency deployments. Env: HEADROOM_WORKERS.

--mode

token | cache

Optimization mode. token (default) prioritizes maximum token reduction; prior turns may be rewritten. cache freezes prior turns to maximize provider prefix-cache hit rates. Legacy aliases token_mode, cache_mode, token_savings, cost_savings, and token_headroom are still accepted. Env: HEADROOM_MODE.

--no-optimize

flag

Disable all compression — run as a passthrough proxy only. Useful for debugging or measuring baseline usage.

--no-cache

flag

Disable the semantic response cache.

--target-ratio

float

Override the Kompress keep-ratio for prose/code compression. Lower is more aggressive (e.g. 0.4 keeps ~40 % of tokens). Unset by default — Kompress decides via its own importance threshold. Env: HEADROOM_TARGET_RATIO.

Logging

--log-file

string

Path to write request/response logs as JSONL. Each line is a JSON object with fields: timestamp, request_id, model, tokens_before, tokens_after, latency_ms, etc. Disabled in --stateless mode. Env: HEADROOM_LOG_FILE.

--log-messages

flag

Enable full message logging — request/response content is stored in the log file. Warning: may log sensitive data. Env: HEADROOM_LOG_MESSAGES.

Budget & rate limiting

--budget

float

Spend cap in USD per --budget-period. Requests are rejected with HTTP 429 once the limit is reached. Env: HEADROOM_BUDGET.

--budget-period

hourly | daily | monthly

default:"daily"

Period the --budget limit applies to. hourly resets on a rolling hour; daily at local midnight; monthly on the 1st. Env: HEADROOM_BUDGET_PERIOD.

--no-rate-limit

flag

Disable per-minute rate limiting entirely.

--rpm

integer

Max requests per minute. Default: 60. Has no effect with --no-rate-limit. Env: HEADROOM_RPM.

--tpm

integer

Max tokens per minute. Default: 100,000. Has no effect with --no-rate-limit. Env: HEADROOM_TPM.

Upstream provider routing

--openai-api-url

string

Custom OpenAI API URL for passthrough endpoints. Env: OPENAI_TARGET_API_URL.

--anthropic-api-url

string

Custom Anthropic API URL for passthrough endpoints. Env: ANTHROPIC_TARGET_API_URL.

--gemini-api-url

string

Custom Gemini API URL for passthrough endpoints. Env: GEMINI_TARGET_API_URL.

--backend

string

default:"anthropic"

API backend: anthropic (direct), bedrock (AWS), openrouter, anyllm, or litellm-<provider> (e.g. litellm-vertex). Env: HEADROOM_BACKEND.

--bedrock-api-url

string

Custom Bedrock InvokeModel upstream. Point at a re-signing gateway, not raw AWS. Env: BEDROCK_TARGET_API_URL.

--region

string

default:"us-west-2"

Cloud region for Bedrock/Vertex/etc. Env: HEADROOM_REGION.

CCR (Compress-Cache-Retrieve)

--no-ccr-inject-tool

flag

Don’t inject the headroom_retrieve tool into requests. Use for streaming or non-MCP clients that can’t resolve the retrieve tool. Env: HEADROOM_NO_CCR_INJECT_TOOL.

--no-ccr-marker

flag

Don’t add CCR retrieval markers to compressed content. Env: HEADROOM_NO_CCR_MARKER.

--lossless

flag

No-CCR lossless mode: compress tool outputs with format-native lossless compaction without emitting any CCR retrieval marker or needing the MCP retrieve tool. Env: HEADROOM_LOSSLESS=1.

--intercept-tool-results

flag

Opt in to tool-result interceptors (AST-grep Read outliner, etc.). Off by default. Requires headroom-ai[tools] extras.

Kompress & tool protection

--disable-kompress

flag

Disable Kompress ML compression while keeping structural compression (ToolCrusher, SmartCrusher, CacheAligner) active. Env: HEADROOM_DISABLE_KOMPRESS=1.

--protect-tool-results

string

Comma-separated tool names whose results are never lossy-compressed. Merged with built-in defaults (e.g. Bash,WebFetch). Env: HEADROOM_PROTECT_TOOL_RESULTS.

--no-ccr-proactive-expansion

flag

Disable proactive expansion of previously compressed content. Env: HEADROOM_NO_CCR_PROACTIVE_EXPANSION.

Read lifecycle

--no-read-lifecycle

flag

Disable Read lifecycle management. By default the proxy compresses stale and superseded Read tool outputs to reclaim context.

Code-aware compression

--code-aware / --no-code-aware

flag

Enable or disable AST-based code compression. Requires pip install headroom-ai[code]. Default: disabled. Env: HEADROOM_CODE_AWARE_ENABLED=1.

--code-graph

flag

Index the current working directory and watch for file changes via codebase-memory-mcp. Useful when the proxy is started from a project root.

Memory

--memory

flag

Enable persistent memory. Auto-detects the provider (Anthropic, OpenAI, Gemini) and uses the appropriate tool format. By default each workspace gets its own SQLite database.

--memory-storage

project | user | global

default:"project"

Memory partitioning strategy. project (default): one DB per resolved workspace. user: one DB per x-headroom-user-id. global: a single shared DB (pre-existing behavior).

--memory-top-k

integer

default:"10"

Number of semantically relevant memories to inject as context (1–100). Env: HEADROOM_MEMORY_TOP_K.

--no-memory-tools

flag

Disable automatic injection of memory_save/memory_search tools. Env: HEADROOM_NO_MEMORY_TOOLS.

--no-memory-context

flag

Disable automatic injection of relevant past memories into the system prompt. Env: HEADROOM_NO_MEMORY_CONTEXT.

Traffic learning

--learn

flag

Enable live traffic learning: extract error→recovery patterns, environment facts, and user preferences from proxy traffic. Implies --memory. Learned patterns are saved to agent-native memory files (CLAUDE.md, .cursor/rules, AGENTS.md).

--no-learn

flag

Explicitly disable traffic learning even when --memory is set.

--min-evidence

integer

Minimum number of times a pattern must be observed before it is persisted. Default: 5. Higher values reduce one-shot noise. Env: HEADROOM_MIN_EVIDENCE.

Connection tuning

--limit-concurrency

integer

default:"1000"

Maximum concurrent connections before Uvicorn returns 503. Env: HEADROOM_LIMIT_CONCURRENCY.

--max-connections

integer

default:"500"

Maximum upstream HTTP connections. Env: HEADROOM_MAX_CONNECTIONS.

--max-keepalive

integer

default:"100"

Maximum upstream keep-alive connections. Env: HEADROOM_MAX_KEEPALIVE.

--request-timeout-seconds

integer

default:"300"

Request timeout in seconds. Useful for slow local providers. Env: HEADROOM_REQUEST_TIMEOUT.

--retry-max-attempts

integer

default:"3"

Maximum upstream retry attempts for connect/read/5xx failures (1–10). Env: HEADROOM_RETRY_MAX_ATTEMPTS.

Deployment & security

--telemetry

flag

Opt in to anonymous usage telemetry (off by default). Env: HEADROOM_TELEMETRY=on.

--no-telemetry

flag

Explicitly disable telemetry. Env: HEADROOM_TELEMETRY=off.

--stateless

flag

Disable all filesystem writes — run purely in-memory. For containerized, read-only, or load-balanced deployments. Memory, TOIN, and log file persistence are all disabled. Env: HEADROOM_STATELESS=true.

When binding to a non-loopback address (e.g. --host 0.0.0.0) without setting HEADROOM_PROXY_TOKEN, all /v1/* endpoints are unauthenticated. Always set an inbound token in network-exposed deployments.

Examples

# Start with memory and traffic learning on port 9000
headroom proxy --port 9000 --memory --learn

# AWS Bedrock backend, us-east-1
headroom proxy --backend bedrock --region us-east-1

# Stateless container deployment
headroom proxy --stateless --host 0.0.0.0 --workers 4

# Aggressive token savings, cache mode, with budget
headroom proxy --mode cache --budget 20 --budget-period daily

# OpenRouter backend
headroom proxy --backend openrouter

# Disable Kompress ML compression (structural compression only)
headroom proxy --disable-kompress

# Code-aware AST compression
headroom proxy --code-aware --code-graph

Python SDK

TypeScript SDK

CLI Reference

Proxy HTTP API

headroom proxy — Start the Compression Proxy Server

Core options

Logging

Budget & rate limiting

Upstream provider routing

CCR (Compress-Cache-Retrieve)

Kompress & tool protection

Read lifecycle

Code-aware compression

Memory

Traffic learning

Connection tuning

Deployment & security

Examples

Build docs developers (and LLMs) love

Python SDK

TypeScript SDK

CLI Reference

Proxy HTTP API

Documentation Index

​Core options

​Logging

​Budget & rate limiting

​Upstream provider routing

​CCR (Compress-Cache-Retrieve)

​Kompress & tool protection

​Read lifecycle

​Code-aware compression

​Memory

​Traffic learning

​Connection tuning

​Deployment & security

​Examples

Build docs developers (and LLMs) love

Core options

Logging

Budget & rate limiting

Upstream provider routing

CCR (Compress-Cache-Retrieve)

Kompress & tool protection

Read lifecycle

Code-aware compression

Memory

Traffic learning

Connection tuning

Deployment & security

Examples