Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

Headroom sits between your application and the LLM provider. It intercepts outgoing messages, compresses them through a deterministic pipeline, and forwards the optimized request. The provider’s response comes back unchanged. Every entry point — SDK wrapper, proxy, and framework integrations — feeds into the same canonical pipeline with the same lifecycle guarantees.

Full Pipeline Diagram

+-----------------------------------------------------------------------+
|                          YOUR APPLICATION                             |
|  Claude Code · Cursor · Codex · LangChain · Agno · your own code…    |
+-----------------------------------------------------------------------+
         │  prompts · tool outputs · logs · RAG results · files

+-----------------------------------------------------------------------+
|                    HEADROOM  (runs locally — data never leaves)       |
|                                                                       |
|  ┌──────────────┐   ┌─────────────────────────────────────────────┐  |
|  │  CacheAligner│──▶│  ContentRouter                              │  |
|  │              │   │    ├─ SmartCrusher   (JSON arrays)          │  |
|  │  Detects     │   │    ├─ CodeAwareCompressor (AST, tree-sitter) │  |
|  │  volatile    │   │    ├─ LogCompressor  (build / test logs)    │  |
|  │  tokens;     │   │    └─ Kompress-v2-base (prose, HF model)   │  |
|  │  warns if    │   └─────────────────────────────────────────────┘  |
|  │  prefix is   │                      │                             |
|  │  unstable    │                      ▼                             |
|  └──────────────┘   ┌─────────────────────────────────────────────┐  |
|                     │  CCR — Compress-Cache-Retrieve               │  |
|                     │  Originals cached locally; LLM retrieves     │  |
|                     │  via headroom_retrieve tool call             │  |
|                     └─────────────────────────────────────────────┘  |
|                                                                       |
|  Cross-agent memory  ·  headroom learn  ·  MCP server                |
+-----------------------------------------------------------------------+
         │  compressed prompt  +  retrieval tool

+-----------------------------------------------------------------------+
|              LLM PROVIDER  (Anthropic · OpenAI · Bedrock · …)        |
+-----------------------------------------------------------------------+

Three Entry Points

All three entry points share the same pipeline and lifecycle. Choose based on how much control you want and how many code changes you’re willing to make.

SDK Mode

Wrap your LLM client with HeadroomClient. Minimal code changes — swap the client constructor.
from headroom import compress

result = compress(messages, model="gpt-4o")
# result.messages, result.tokens_saved

Proxy Mode

Run headroom proxy --port 8787 and point your existing client at localhost:8787. Zero code changes — just update the base URL.
headroom proxy --port 8787
# or wrap a coding agent:
headroom wrap claude

Integrations

Drop-in adapters for LangChain, Vercel AI SDK, Agno, LiteLLM, and ASGI middleware. Framework-specific setup, same pipeline underneath.
# LangChain
from headroom.integrations import HeadroomChatModel
llm = HeadroomChatModel(your_llm)

# Vercel AI SDK
wrapLanguageModel({ model, middleware: headroomMiddleware() })

The Transform Pipeline

Headroom exposes one stable request lifecycle across compress(), the SDK, and the proxy. Every request transitions through these stages in order:
Setup → Pre-Start → Post-Start → Input Received → Input Cached
      → Input Routed → Input Compressed → Input Remembered
      → Pre-Send → Post-Send → Response Received
These map to PipelineStage enum values in headroom/pipeline.py:
StageWhat happens
SETUPPipeline initialization, provider detection
PRE_STARTPre-flight checks, rate limiting
POST_STARTSession context loaded
INPUT_RECEIVEDRaw messages ingested; extension hooks may rewrite them
INPUT_CACHEDProvider KV cache state checked; frozen prefix identified
INPUT_ROUTEDContentRouter assigns a compressor to each message block
INPUT_COMPRESSEDCompressors run; CCR originals cached; metrics captured
INPUT_REMEMBEREDCross-agent memory consulted; SharedContext updated
PRE_SENDFinal message array assembled; provider headers added
POST_SENDRequest sent to provider; streaming begins
RESPONSE_RECEIVEDResponse intercepted; CCR tool calls handled

Stage 1: CacheAligner (prefix stabilization)

CacheAligner runs first, before any compression, so it can measure the stable prefix before it’s potentially altered. It detects volatile tokens (UUIDs, ISO 8601 timestamps, JWTs, hex hashes) in system messages and emits warnings. The prompt itself is never modified.
Before: "You are helpful. Current Date: 2025-06-15"
         ─────────────────────────────────────────
         Changes daily → cache miss every request

Warning: [CacheAligner] system prompt contains 1 volatile token (iso8601: "2025-06-15")
         Prefix will not cache. Move dynamic content to the message tail.

Stage 2: SmartCrusher / ContentRouter (content compression)

ContentRouter dispatches each message block to the optimal compressor (see How Compression Works). This is where the bulk of token savings happen:
  • SmartCrusher for JSON arrays: field-level statistical analysis, Kneedle algorithm, anomaly/error preservation.
  • CodeAwareCompressor for source code: AST parsing via tree-sitter, syntax-valid output.
  • LogCompressor for build/test logs: pattern clustering, error line preservation.
  • Kompress-v2-base for plain text: ONNX INT8 ModernBERT inference.

Provider-Specific Behavior

Provider and tool-specific behavior is isolated under headroom/providers/ so the core orchestration stays focused on lifecycle and policy.
headroom/providers/
  ├── anthropic/     — cache_control injection, streaming handler, batch API
  ├── openai/        — prefix caching alignment, tool normalization
  ├── gemini/        — CachedContent API lifecycle management
  ├── claude/        — CLI/tool slices for headroom wrap claude
  ├── copilot/       — GitHub Copilot CLI subscription routing
  ├── codex/         — Codex-specific memory sharing
  ├── openclaw/      — ContextEngine plugin integration
  └── registry.py    — Backend selection and runtime dispatch
Each provider slice handles: API target normalization, backend selection, transport dispatch, and provider-specific response parsing. Core files (client.py, cli/proxy.py, proxy/server.py) delegate entirely to provider slices.

Pipeline Extensions

Headroom exposes two extension seams for customizing the lifecycle without forking core code.

on_pipeline_event() — PipelineExtension protocol

Register an extension that receives every PipelineEvent as it fires:
from headroom.pipeline import PipelineExtension, PipelineEvent, PipelineStage

class MyExtension:
    def on_pipeline_event(self, event: PipelineEvent) -> PipelineEvent | None:
        if event.stage == PipelineStage.INPUT_COMPRESSED:
            tokens_before = event.metadata.get("tokens_before", 0)
            tokens_after = event.metadata.get("tokens_after", 0)
            print(f"Compressed: {tokens_before}{tokens_after} tokens")
        return event  # return None to leave event unchanged
Extensions are discovered automatically via Python entry points (group headroom.pipeline_extension) or passed directly:
from headroom import compress
from headroom.config import HeadroomConfig

result = compress(
    messages,
    model="gpt-4o",
    hooks=MyExtension(),  # passed as hooks; must implement on_pipeline_event
)

CompressionHooks — additional extension seam

CompressionHooks sits alongside the canonical lifecycle as a complementary seam. It provides pre_compress, post_compress, and compute_biases callbacks for finer-grained control over the compression step itself:
from headroom.hooks import CompressionHooks, CompressContext, CompressEvent

class MyHooks(CompressionHooks):
    def pre_compress(self, messages, ctx: CompressContext):
        # Inspect or rewrite messages before compression
        return messages

    def post_compress(self, event: CompressEvent):
        # Record metrics after compression
        print(f"Saved {event.tokens_saved} tokens via {event.transforms_applied}")

    def compute_biases(self, messages, ctx: CompressContext):
        # Return per-message compression bias hints
        return None

result = compress(messages, model="gpt-4o", hooks=MyHooks())

Proxy Architecture

The proxy is a FastAPI + Uvicorn server that acts as a drop-in OpenAI/Anthropic-compatible endpoint. Upstream requests are forwarded via an httpx async client.
Client request
  ──► FastAPI route handler (headroom/proxy/server.py)
  ──► PipelineExtensionManager.emit(INPUT_RECEIVED)
  ──► CacheAligner.apply()
  ──► ContentRouter.apply()  →  SmartCrusher / CodeAwareCompressor / Kompress
  ──► CCR: cache originals, inject headroom_retrieve tool
  ──► PipelineExtensionManager.emit(PRE_SEND)
  ──► httpx upstream client → LLM provider
  ──► CCRResponseHandler intercepts streaming / non-streaming response
  ──► PipelineExtensionManager.emit(RESPONSE_RECEIVED)
  ──► Response forwarded to client
Proxy extensions remain the server/app integration seam for ASGI middleware, custom routes, and startup policy — separate from the canonical pipeline lifecycle extensions above:
from headroom.middleware import CompressionMiddleware

# ASGI middleware (FastAPI, Starlette, any ASGI app)
app.add_middleware(CompressionMiddleware)

What Headroom Does Not Touch

  • User messages — never compressed by default (compress_user_messages=False)
  • System prompt content — preserved exactly; CacheAligner only emits warnings
  • Code — passes through unchanged unless CodeAwareCompressor is explicitly enabled (pip install "headroom-ai[code]")
  • Short content — tool outputs under min_tokens_to_compress (default: 250 tokens) pass through unchanged
  • Model responses — returned to your client exactly as received from the provider
  • Excluded toolsRead, Glob, Grep, Write, and Edit tool outputs are excluded from compression by default (they contain exact file content needed for edit workflows)

Build docs developers (and LLMs) love