Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

This guide takes you from a fresh install to a compressed LLM call with measured savings. You will install the package, compress a realistic message thread containing a large tool output, send the result to your LLM, and inspect how many tokens were removed. If you prefer zero code changes, the final section covers proxy mode — point any existing client at http://localhost:8787 and compression happens automatically.

Step 1: Install

pip install "headroom-ai[all]"
This installs the headroom CLI, the compress() library function, the local proxy, the Kompress-v2-base model, and all compressors. Requires Python 3.10+.
Prefer pipx or uv? Use an explicit Python 3.13 interpreter to unlock the full savings dashboard (the Proxy $ Saved tile requires LiteLLM, which does not yet support Python 3.14+):
pipx install --python python3.13 "headroom-ai[all]"
# or
uv tool install --python 3.13 "headroom-ai[all]"

Step 2: Compress Messages

Pass your message list to compress(). Headroom returns the same list in the same format, with tool outputs, logs, and repeated content stripped down to their essential information.
from headroom import compress
import json

messages = [
    {"role": "system", "content": "You analyze search results."},
    {"role": "user", "content": "Search for Python tutorials."},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_1",
            "type": "function",
            "function": {"name": "search", "arguments": '{"q": "python"}'},
        }],
    },
    {
        "role": "tool",
        "tool_call_id": "call_1",
        "content": json.dumps({
            "results": [
                {"title": f"Result {i}", "snippet": f"Description {i}", "score": 100 - i}
                for i in range(500)
            ]
        }),
    },
    {"role": "user", "content": "What are the top 3 results?"},
]

result = compress(messages, model="gpt-4o")

Step 3: Send to Your LLM

Use result.messages exactly as you would the originals. The compressed messages are in the same format — you do not need to change any other part of your call.
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=result.messages,   # drop-in replacement
)

print(response.choices[0].message.content)
Works identically with the Anthropic SDK:
from anthropic import Anthropic
from headroom import compress

client = Anthropic()
compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=compressed.messages,
    max_tokens=1024,
)

Step 4: Check Your Savings

The CompressResult object carries a full accounting of what was removed.
print(f"Tokens before: {result.tokens_before}")
print(f"Tokens after:  {result.tokens_after}")
print(f"Tokens saved:  {result.tokens_saved}")
print(f"Compression:   {result.compression_ratio:.0%}")
print(f"Transforms:    {result.transforms_applied}")
Example output for a 500-item JSON search result:
Tokens before: 45000
Tokens after:  4500
Tokens saved:  40500
Compression:   90%
Transforms:    ['smart_crusher', 'cache_aligner']
The compression_ratio field expresses the fraction of tokens removed, not the fraction kept. A value of 0.9 means 90% of tokens were eliminated. A value of 0.35 means 65% were saved (1 - 0.35).

Alternative: Proxy Mode (Zero Code Changes)

If you do not want to modify any existing code, run Headroom as a local HTTP proxy and point your client at it. Every request flows through the compression pipeline automatically.
# Start the proxy
headroom proxy --port 8787

# Point Claude Code at it
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 your-app
Check cumulative savings at any time:
curl http://localhost:8787/stats
# {"requests_total": 42, "tokens_saved_total": 125000, ...}

headroom perf          # pretty-printed savings summary
headroom dashboard     # open live dashboard in browser
To wrap a coding agent in one command (starts the proxy and injects the correct environment):
headroom wrap claude     # Claude Code
headroom wrap codex      # OpenAI Codex
headroom wrap aider      # Aider
headroom wrap cursor     # Cursor (prints base URLs for manual setup)

What Gets Compressed

Headroom auto-detects content type and routes each block to the best compressor. No configuration is needed — the biggest savings come automatically from tool outputs, which are almost always over-verbose JSON or log files.
Content typeCompressorTypical savings
JSON arraysSmartCrusher70–90%
Source codeCodeCompressor40–70%
Build / test logsLogCompressor80–95%
Search resultsSearchCompressor60–80%
Plain textKompress30–50%
Messages shorter than 250 tokens are left unchanged by default. This threshold is configurable via CompressConfig(min_tokens_to_compress=...) — lower it for voice agents with short turns.

Next Steps

Installation

Docker tags, pipx, uv, Windows setup, and environment variables.

Proxy Server

Configure the proxy, run it as a persistent service, and view the dashboard.

How Compression Works

ContentRouter, SmartCrusher, CodeCompressor, and Kompress-v2-base in depth.

Configuration

CompressConfig fields, target ratios, protecting recent messages, and more.

Build docs developers (and LLMs) love