Headroom: Context Compression for AI Agents

Headroom is the context compression layer for AI agents. It sits between your application and the LLM provider, compressing everything the model reads — tool outputs, logs, RAG results, files, and conversation history — before it reaches the LLM. You get the same answers at a fraction of the token cost.

Quickstart

Get from zero to compressed LLM calls in under 5 minutes.

Installation

Install via pip, npm, or Docker with the right extras for your stack.

How Compression Works

SmartCrusher, CodeCompressor, Kompress, and ContentRouter explained.

API Reference

Full Python SDK, TypeScript SDK, CLI, and Proxy HTTP API reference.

Pick your integration path

Library

Call compress(messages) in Python or TypeScript. Drop into any LLM app, no infra required.

Proxy

Run headroom proxy and point any existing client at it. Zero code changes.

Agent Wrap

One command wraps Claude Code, Codex, Cursor, Aider, Cline, and more.

MCP Server

Install as an MCP tool for Claude Code, Cursor, or any MCP-compatible host.

Real savings on real workloads

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

Accuracy is preserved. GSM8K math benchmark: ±0.000 delta. SQuAD v2 QA: 97% at 19% compression. BFCL tool-use: 97% at 32% compression.

Key features

Reversible Compression (CCR)

Originals are cached locally. The LLM calls headroom_retrieve when it needs the full data — nothing is permanently lost.

ContentRouter

Auto-detects JSON, code, logs, plain text, and images — routes each to the best compressor automatically.

Persistent Memory

Hierarchical, temporal memory across conversations and agents. Zero extra latency — extraction happens inline.

Failure Learning

headroom learn mines past sessions and writes corrections to CLAUDE.md, AGENTS.md, or GEMINI.md.

Cache Optimization

CacheAligner stabilizes prefixes so Anthropic and OpenAI KV caches actually hit on repeated calls.

Output Token Reduction

Verbosity steering and effort routing also cut what the model writes back — at 5× the per-token cost on Opus-class models.

Integrations

Headroom works with every major Python and TypeScript LLM framework:

OpenAI SDK

withHeadroom(new OpenAI())

Anthropic SDK

withHeadroom(new Anthropic())

LangChain

HeadroomChatModel

Vercel AI SDK

headroomMiddleware()

Agno

HeadroomAgnoModel

LiteLLM

HeadroomCallback()

Headroom is local-first. Your data never leaves your machine. The compression pipeline runs entirely on your hardware using local models and a local SQLite store.

Get Started

Modes of Use

Core Concepts

Features

Integrations

Operations

Headroom: Context Compression for AI Agents

Quickstart

Installation

How Compression Works

API Reference

Pick your integration path

Library

Proxy

Agent Wrap

MCP Server

Real savings on real workloads

Key features

Reversible Compression (CCR)

ContentRouter

Persistent Memory

Failure Learning

Cache Optimization

Output Token Reduction

Integrations

OpenAI SDK

Anthropic SDK

LangChain

Vercel AI SDK

Agno

LiteLLM

Build docs developers (and LLMs) love

Get Started

Modes of Use

Core Concepts

Features

Integrations

Operations

Documentation Index

Quickstart

Installation

How Compression Works

API Reference

​Pick your integration path

Library

Proxy

Agent Wrap

MCP Server

​Real savings on real workloads

​Key features

Reversible Compression (CCR)

ContentRouter

Persistent Memory

Failure Learning

Cache Optimization

Output Token Reduction

​Integrations

OpenAI SDK

Anthropic SDK

LangChain

Vercel AI SDK

Agno

LiteLLM

Build docs developers (and LLMs) love

Pick your integration path

Real savings on real workloads

Key features

Integrations