Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

The compress() function is the simplest entry point into Headroom. Pass a list of messages, get a compressed list back — no proxy, no configuration, no client wrapping needed. It works identically whether you’re using the Anthropic SDK, OpenAI SDK, LiteLLM, or a raw HTTP client.
from headroom import compress

result = compress(messages, model="claude-sonnet-4-5-20250929")
result.messages           # Compressed messages — same format as input
result.tokens_saved       # How many tokens were removed
result.compression_ratio  # e.g., 0.65 means 65% of tokens saved

Function Signature

def compress(
    messages: list[dict[str, Any]],
    model: str = "claude-sonnet-4-5-20250929",
    model_limit: int = 200000,
    optimize: bool = True,
    hooks: Any = None,
    config: CompressConfig | None = None,
    **kwargs: Any,
) -> CompressResult: ...

Parameters

messages
list[dict[str, Any]]
required
List of messages in Anthropic or OpenAI format. Each message should have at minimum a role key and a content key.
model
str
default:"claude-sonnet-4-5-20250929"
Model name used for token counting and determining the context window limit. Pass the exact model string you will use when calling the LLM (e.g. "gpt-4o", "claude-opus-4-20250514").
model_limit
int
default:"200000"
Context window size in tokens for the target model. Headroom uses this to decide how aggressively to compress when the context is nearly full.
optimize
bool
default:"true"
Set to False to disable compression entirely and return the original messages unchanged. Useful for A/B testing or gradual rollout.
hooks
CompressionHooks | None
default:"None"
Optional CompressionHooks instance. Hooks let you inject pre-processing, per-message bias overrides, and post-compression observability callbacks.
config
CompressConfig | None
default:"None"
Full compression configuration object. Individual **kwargs override fields inside this object; you can mix both.
**kwargs
Any
Shorthand for any CompressConfig field — passed directly as keyword arguments. Valid keys: compress_user_messages, compress_system_messages, protect_recent, protect_analysis_context, target_ratio, min_tokens_to_compress, kompress_model, savings_profile.

CompressConfig

CompressConfig controls what gets compressed, how aggressively, and with which model variant. Pass it as config= or use the shorthand **kwargs form.
from headroom import compress, CompressConfig

cfg = CompressConfig(
    compress_user_messages=True,
    target_ratio=0.5,
    protect_recent=0,
)
result = compress(messages, model="claude-opus-4-20250514", config=cfg)
compress_user_messages
bool
default:"false"
Compress user messages as well as tool outputs. Default is False because coding agents need to see exact user instructions. Set True for document pipelines, RAG, or when user messages contain large tool outputs.
compress_system_messages
bool
default:"true"
Compress system messages. Set False to preserve system prompts exactly, for example in voice agents where tool definitions must not be altered.
protect_recent
int
default:"4"
Number of trailing messages to leave uncompressed. These are the active conversation turns. Set 0 to compress the entire context.
protect_analysis_context
bool
default:"true"
Detect analyze / review intent in the conversation and protect code blocks from compression when found.
target_ratio
float | None
default:"None"
Keep ratio for the text (Kompress) compressor. None lets the model decide (~15% kept, aggressive). 0.5 keeps 50% (safe for documents). Only affects text compression — SmartCrusher (JSON) uses its own statistical logic.
min_tokens_to_compress
int
default:"250"
Minimum token count for a message to be eligible for compression. Messages shorter than this threshold are left untouched.
kompress_model
str | None
default:"None"
HuggingFace model ID for the Kompress text compressor. None uses the default (chopratejas/kompress-v2-base). Set to "disabled" to skip ML text compression entirely — only SmartCrusher and CacheAligner will run.
savings_profile
str | None
default:"None"
Named high-savings preset. For example, "agent-90" applies settings tuned for Codex/Claude/Cursor coding agents targeting 90% token reduction.

CompressResult

compress() always returns a CompressResult — it never raises on failure, instead reverting to the original messages and logging a warning.
messages
list[dict[str, Any]]
The compressed messages in the same format as the input. Drop-in replacement for the original messages list.
tokens_before
int
Token count before compression. 0 if compression was skipped (empty input, optimize=False, or a failure that triggered the safety guard).
tokens_after
int
Token count after compression.
tokens_saved
int
Tokens removed: tokens_before - tokens_after.
compression_ratio
float
Fraction of tokens saved. 0.0 means nothing was saved; 0.65 means 65% of tokens were removed.
transforms_applied
list[str]
Internal names of every transform that ran, e.g. ["router:tool_result:json", "smart_crusher", "cache_aligner"]. Useful for debugging. When the inflation guard fires, the list contains ["inflation_guard:reverted"].
If compression would inflate the token count (a rare edge case), Headroom automatically reverts to the original messages. The returned CompressResult will have tokens_saved=0 and transforms_applied=["inflation_guard:reverted"].

Usage Examples

from anthropic import Anthropic
from headroom import compress

client = Anthropic()
messages = [{"role": "user", "content": huge_tool_output}]

compressed = compress(messages, model="claude-sonnet-4-5-20250929")

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=compressed.messages,   # <-- drop in compressed messages
)

print(f"Saved {compressed.tokens_saved} tokens ({compressed.compression_ratio:.0%})")

Document and Financial Pipelines

For documents, RAG pipelines, or contexts where user messages contain large data payloads, enable user-message compression and adjust target_ratio:
from headroom import compress

result = compress(
    messages,
    model="claude-opus-4-20250514",
    compress_user_messages=True,   # user messages contain the document
    target_ratio=0.5,              # keep 50% of text (conservative)
    protect_recent=0,              # compress everything, no active turns
)

Using Hooks for Observability

Pass a CompressionHooks subclass to instrument compression events:
from headroom import compress
from headroom.hooks import CompressionHooks, CompressEvent, CompressContext

class LoggingHooks(CompressionHooks):
    def post_compress(self, event: CompressEvent) -> None:
        print(
            f"[{event.model}] {event.tokens_before}{event.tokens_after} "
            f"({event.compression_ratio:.0%} saved)"
        )

result = compress(messages, model="gpt-4o", hooks=LoggingHooks())

compress_spreadsheet()

compress_spreadsheet() compresses .xlsx and .xls files by rendering each sheet as CSV text and running the full compression pipeline per sheet. Requires the spreadsheet extra: pip install headroom-ai[spreadsheet].
def compress_spreadsheet(
    path: str,
    model: str = "claude-sonnet-4-5-20250929",
    model_limit: int = 200000,
    **kwargs: Any,
) -> CompressResult: ...

Parameters

path
str
required
Filesystem path to an .xlsx or .xls file.
model
str
default:"claude-sonnet-4-5-20250929"
Model name for token counting and context limit determination.
model_limit
int
default:"200000"
Model context window size in tokens.
**kwargs
Any
Forwarded to compress(). For example, pass target_ratio=0.3 to compress each sheet to 30% of its original size.

Example

from headroom import compress_spreadsheet

result = compress_spreadsheet(
    "quarterly_report.xlsx",
    model="gpt-4o",
    target_ratio=0.4,
)

# result.messages is a list of {"role": "user", "content": <sheet CSV>}
# Send to your LLM as-is
print(f"Sheets compressed: {len(result.messages)}")
print(f"Tokens saved: {result.tokens_saved}")
Each sheet becomes its own user message. The tabular compressor (CSV → SmartCrusher) runs per sheet, applying lossless column/row compaction first, falling back to lossy row-drop with CCR markers when lossless savings are insufficient.

Handling the Result

compress() is safe to call unconditionally — it never throws on compression failure:
from headroom import compress

result = compress(messages, model="gpt-4o")

# Always use result.messages — safe even on failure
response = client.chat.completions.create(
    model="gpt-4o",
    messages=result.messages,
)

# Optional: log savings
if result.tokens_saved > 0:
    print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")
    print(f"Transforms: {result.transforms_applied}")

Build docs developers (and LLMs) love