Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

Headroom sits between your agent and the LLM provider, intercepting every prompt before it is sent and compressing its contents — tool outputs, build logs, RAG chunks, file reads, and conversation history — using content-aware algorithms that preserve the information the model actually needs. This page explains what Headroom does, how it is deployed, the architecture that makes it work, and how it compares to alternative approaches.

What Headroom Does

Every AI agent accumulates context waste: search results with hundreds of near-duplicate entries, log files where one FATAL line matters, JSON arrays where the first and last entries tell the whole story. Headroom removes that waste automatically. The LLM sees a compressed prompt containing the same signal in a fraction of the tokens, and it answers identically — the benchmarks prove it.

Quickstart

Install and compress your first messages in under 5 minutes.

Proxy Guide

Zero code changes: run headroom proxy and point your client at it.

Agent Wrap

One command wraps Claude Code, Codex, Cursor, Aider, and more.

How Compression Works

ContentRouter, SmartCrusher, CodeCompressor, and Kompress-v2-base explained.

Proven Savings and Accuracy

Headroom delivers 60–95% token reduction on real agent workloads without degrading model quality: Savings on real workloads:
WorkloadBeforeAfterSavings
Code search (100 results)17,7651,40892%
SRE incident debugging65,6945,11892%
GitHub issue triage54,17414,76173%
Codebase exploration78,50241,25447%
Accuracy preserved on standard benchmarks:
BenchmarkCategoryNBaselineHeadroomDelta
GSM8KMath1000.8700.870±0.000
TruthfulQAFactual1000.5300.560+0.030
SQuAD v2QA10097%19% compression
BFCLTools10097%32% compression
Reproduce the benchmarks yourself: python -m headroom.evals suite --tier 1

Three Deployment Modes

Headroom meets you wherever you are. You do not need to restructure your code to start saving tokens.
Import compress() directly in any Python or TypeScript application. One call, no proxy, no infrastructure:
from headroom import compress

result = compress(messages, model="claude-sonnet-4-5-20250929")
# result.messages  — compressed, same format
# result.tokens_saved, result.compression_ratio
import { compress } from 'headroom-ai';

const result = await compress(messages, {
  model: 'gpt-4o',
  baseUrl: 'http://localhost:8787',
});

Architecture

Your agent / app
  (Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code…)
       │   prompts · tool outputs · logs · RAG results · files

   ┌────────────────────────────────────────────────────┐
   │  Headroom   (runs locally — your data stays here)  │
   │  ────────────────────────────────────────────────  │
   │  CacheAligner  →  ContentRouter  →  CCR            │
   │                    ├─ SmartCrusher   (JSON)        │
   │                    ├─ CodeCompressor (AST)         │
   │                    └─ Kompress-v2-base (text, HF)  │
   │                                                    │
   │  Cross-agent memory  ·  headroom learn  ·  MCP     │
   └────────────────────────────────────────────────────┘
       │   compressed prompt  +  retrieval tool

LLM provider  (Anthropic · OpenAI · Bedrock · …)

ContentRouter

The ContentRouter inspects every message block, detects its content type (JSON, source code, log file, search results, plain prose), and dispatches it to the right compressor. Detection uses Magika’s ML-based content-type classifier — no regex fragility, no configuration required.

SmartCrusher (JSON)

Handles JSON arrays of objects — the format most tool outputs use. It deduplicates near-identical array entries, preserves anomalies and high-relevance items, and applies BM25 scoring against the user’s query. Typical savings: 70–90%.

CodeCompressor (AST)

Uses tree-sitter to parse source code into an AST, then strips comments, docstrings, and body blocks that are structurally irrelevant to the current query. Supports Python, JavaScript/TypeScript, Go, Rust, Java, C/C++, and Perl. Typical savings: 40–70%.

Kompress-v2-base

A HuggingFace model fine-tuned on agentic traces. It handles prose — plain text, documentation snippets, conversation history — that doesn’t fit a structured compressor. Runs locally via ONNX INT8 quantization (no GPU needed). Typical savings: 30–50%.

CacheAligner

Stabilizes message prefixes so provider KV caches (Anthropic prompt caching, OpenAI context caching) actually hit. Without stable prefixes, even repeated system prompts miss the cache and cost full price on every request.

CCR: Reversible Compression

Every compressed block is stored locally under a content-addressed hash (CCR — Cached Compressed Reference). If the LLM needs the original during a reasoning step, it calls headroom_retrieve with the hash and receives the full original text.
Originals are cached locally within the configured TTL. CCR makes Headroom’s compression reversible — the model is never cut off from information it legitimately needs.

Local-First: Your Data Never Leaves Your Machine

Headroom runs the entire compression pipeline on your hardware. No content is sent to a third-party compression service. The headroom proxy process, the Kompress model weights, and the CCR cache all live on your machine under ~/.headroom/.
Running Headroom across a whole engineering org? The open-source project is designed for individual developers. For shared, always-on deployments with centralized config, org-wide dashboards, and SSO, contact hello@headroomlabs.ai.

How Headroom Compares

ScopeDeployLocalReversible
HeadroomAll context — tools, RAG, logs, files, historyProxy · library · middleware · MCP
RTKCLI command outputsCLI wrapper
lean-ctxCLI commands, MCP tools, editor rulesCLI wrapper · MCP
Compresr / Token Co.Text sent to their APIHosted API call
OpenAI CompactionConversation historyProvider-native
Headroom ships with the RTK binary for shell-output rewriting and can also use lean-ctx as the selected CLI context tool (HEADROOM_CONTEXT_TOOL=lean-ctx). Both are first-class parts of the Headroom stack — Headroom compresses everything downstream of them.

Integrations

Drop Headroom into any existing stack:
SetupIntegration
Any Python appcompress(messages, model=…)
Any TypeScript appawait compress(messages, { model })
Anthropic / OpenAI SDKwithHeadroom(new Anthropic()) · withHeadroom(new OpenAI())
Vercel AI SDKwrapLanguageModel({ model, middleware: headroomMiddleware() })
LiteLLMlitellm.callbacks = [HeadroomCallback()]
LangChainHeadroomChatModel(your_llm)
AgnoHeadroomAgnoModel(your_model)
ASGI appsapp.add_middleware(CompressionMiddleware)
Multi-agentSharedContext().put / .get
MCP clientsheadroom mcp install

Build docs developers (and LLMs) love