Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

HeadroomClient is the full-featured integration path for production applications. It wraps your existing LLM client and intercepts every request, applying compression, cache alignment, and provider-specific cache optimization transparently. The wrapped client is API-compatible with the original — no call-site changes needed.
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

# Use exactly like the original OpenAI client
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

stats = client.get_stats()
print(f"Tokens saved this session: {stats['session']['tokens_saved_total']}")

Constructor

HeadroomClient(
    original_client: Any,
    provider: Provider,
    store_url: str | None = None,
    default_mode: str = "audit",
    model_context_limits: dict[str, int] | None = None,
    cache_optimizer: BaseCacheOptimizer | None = None,
    enable_cache_optimizer: bool = True,
    enable_semantic_cache: bool = False,
    config: HeadroomConfig | None = None,
)
original_client
Any
required
The underlying LLM client to wrap. Accepts any OpenAI-compatible client (e.g. openai.OpenAI(), openai.AsyncOpenAI()) or Anthropic client (anthropic.Anthropic()). The wrapped client is available as client._original.
provider
Provider
required
Provider instance for token counting, context limits, and cost estimation. Use OpenAIProvider() for OpenAI-compatible APIs, AnthropicProvider() for Anthropic. See Providers below.
store_url
str | None
default:"None"
Storage URL for the metrics database. Supports sqlite:///path/to/headroom.db and jsonl:///path/to/metrics.jsonl. Defaults to a SQLite file in the system temp directory.
default_mode
str
default:"\"audit\""
Default operating mode for all requests. One of "audit" (observe only, no modifications), "optimize" (apply full compression pipeline). Can be overridden per request via headroom_mode=.
model_context_limits
dict[str, int] | None
default:"None"
Override the context window size for specific models. Keys are model name strings; values are token counts. The provider’s built-in limits are used for any model not listed here.
cache_optimizer
BaseCacheOptimizer | None
default:"None"
Supply a custom cache optimizer instance. When None and enable_cache_optimizer=True, the optimizer is auto-detected from the provider (e.g. AnthropicCacheOptimizer for Anthropic, OpenAICacheOptimizer for OpenAI).
enable_cache_optimizer
bool
default:"true"
Enable provider-specific cache optimization. When enabled, the client inserts cache_control breakpoints (Anthropic) or stabilizes the prefix (OpenAI) to maximize provider-side KV-cache hit rates.
enable_semantic_cache
bool
default:"false"
Enable query-level semantic caching. When a new request is semantically similar to a recent one (above the configured similarity threshold), the cached response is returned immediately without calling the API.
config
HeadroomConfig | None
default:"None"
Full HeadroomConfig object. When provided, takes precedence over store_url, default_mode, and other individual parameters. Use this for fine-grained control over all subsystems.

chat.completions.create()

Sends a chat completion request with optional Headroom optimization. Accepts all standard OpenAI parameters plus Headroom-specific overrides prefixed with headroom_.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    # Headroom overrides (all optional)
    headroom_mode="optimize",
    headroom_cache_prefix_tokens=8192,
    headroom_output_buffer_tokens=2000,
    headroom_keep_turns=5,
    headroom_tool_profiles={"web_search": {"skip_compression": True}},
    # Standard OpenAI params pass through unchanged
    temperature=0.7,
    stream=False,
)
model
str
required
Model name for the completion.
messages
list[dict[str, Any]]
required
Message list in OpenAI format.
stream
bool
default:"false"
Stream the response. Headroom compresses before sending; streaming behavior is identical to the underlying client.
headroom_mode
str | None
default:"None"
Override the client’s default_mode for this single request. Values: "audit", "optimize".
headroom_cache_prefix_tokens
int | None
default:"None"
Target number of tokens to keep in the stable, cache-aligned prefix. Passed to the cache optimizer.
headroom_output_buffer_tokens
int | None
default:"None"
Number of tokens to reserve for the model’s output when sizing the input context. Overrides HeadroomConfig.output_buffer_tokens (default 4000).
headroom_keep_turns
int | None
default:"None"
Never compress the last N turns of the conversation. Overrides the pipeline’s default protect_recent setting.
headroom_tool_profiles
dict[str, dict[str, Any]] | None
default:"None"
Per-tool compression overrides. Keys are tool names; values are profile dicts. For example: {"important_tool": {"skip_compression": True}}.
**kwargs
Any
All remaining keyword arguments are forwarded verbatim to the underlying client (e.g. temperature, max_tokens, tools, tool_choice).

messages.create() (Anthropic style)

When wrapping an Anthropic client, use client.messages.create() instead:
from headroom import HeadroomClient, AnthropicProvider
from anthropic import Anthropic

client = HeadroomClient(
    original_client=Anthropic(),
    provider=AnthropicProvider(),
    default_mode="optimize",
)

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this report: ..."}],
    headroom_mode="optimize",
)
The messages.create() and messages.stream() methods accept the same headroom_* parameters as chat.completions.create(), plus max_tokens.

chat.completions.simulate()

Run the full compression pipeline and return a SimulationResult without making an API call. Use this to preview token savings before enabling optimize mode.
plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=large_messages,
)

print(f"Tokens before: {plan.tokens_before}")
print(f"Tokens after:  {plan.tokens_after}")
print(f"Tokens saved:  {plan.tokens_saved}")
print(f"Est. savings:  {plan.estimated_savings}")
print(f"Transforms:    {plan.transforms}")
model
str
required
Model to simulate compression for.
messages
list[dict[str, Any]]
required
Messages to simulate compressing.
headroom_mode
str
default:"\"optimize\""
Which mode to simulate. Always uses "optimize" unless overridden.
headroom_output_buffer_tokens
int | None
default:"None"
Output buffer to use in the simulation.
headroom_tool_profiles
dict[str, dict[str, Any]] | None
default:"None"
Per-tool profiles to apply during simulation.

SimulationResult fields

tokens_before
int
Token count of the input messages.
tokens_after
int
Projected token count after compression.
tokens_saved
int
tokens_before - tokens_after.
transforms
list[str]
Transforms that would be applied.
estimated_savings
str
Human-readable cost estimate, e.g. "$0.0042 per request".
messages_optimized
list[dict[str, Any]]
The projected compressed messages (useful for inspection).
block_breakdown
dict[str, int]
Token breakdown by block type (system, user, assistant, tool_result, etc.).
waste_signals
dict[str, int]
Detected waste by category (json_bloat, html_noise, whitespace, repetition, etc.).
stable_prefix_hash
str
16-character hash of the stable cache prefix after optimization.
cache_alignment_score
float
Score between 0 and 1 indicating how cache-friendly the prefix is.

get_stats()

Return in-memory session statistics without querying the database. This is the fast path — no I/O.
stats = client.get_stats()

print(stats["session"]["requests_total"])       # int
print(stats["session"]["tokens_saved_total"])   # int
print(stats["session"]["cache_hits"])           # int
print(stats["config"]["mode"])                  # "audit" | "optimize"
print(stats["config"]["provider"])              # "openai" | "anthropic" | ...
print(stats["transforms"]["smart_crusher_enabled"])  # bool
session
dict
config
dict
transforms
dict

validate_setup()

Run a health check against all configured subsystems. Call this during startup to catch misconfiguration early.
result = client.validate_setup()

if not result["valid"]:
    print("Setup issues:", result)
else:
    print("All systems operational")
Returns a dict with a top-level valid boolean and per-subsystem status:
valid
bool
True only when all subsystems pass. False if any required subsystem fails.
provider
dict
{"ok": bool, "name": str | None, "error": str | None} — provider token-counting test.
storage
dict
{"ok": bool, "url": str, "error": str | None} — read access to the metrics database.
config
dict
{"ok": bool, "mode": str, "error": str | None} — configuration validity check.
cache_optimizer
dict
{"ok": bool, "name": str | None, "error": str | None} — cache optimizer status. A failure here does not set valid=False (cache issues are non-fatal).

get_metrics() and get_summary()

For historical analysis, query the metrics database directly:
from datetime import datetime, timedelta

# Per-request metrics (last hour, up to 100 records)
metrics = client.get_metrics(
    start_time=datetime.utcnow() - timedelta(hours=1),
    limit=100,
    model="gpt-4o",      # optional filter
    mode="optimize",     # optional filter
)

# Aggregate summary
summary = client.get_summary(
    start_time=datetime.utcnow() - timedelta(days=1),
)
Both methods return structured data types — get_metrics() returns list[RequestMetrics] and get_summary() returns a summary dict. See RequestMetrics for the full field listing.

Providers

Providers supply token counting, context limits, and cost estimation. Import them from headroom:
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)
Both providers implement the Provider protocol, which exposes:
  • provider.get_token_counter(model) — returns a TokenCounter for the given model
  • provider.get_context_limit(model) — returns the context window size in tokens
  • provider.estimate_cost(input_tokens, output_tokens, model) — returns cost in USD or None

HeadroomMode

HeadroomMode is a string enum that controls how the pipeline processes each request.
from headroom import HeadroomMode

HeadroomMode.AUDIT     # "audit"    — observe only, never modify messages
HeadroomMode.OPTIMIZE  # "optimize" — run full compression pipeline
HeadroomMode.SIMULATE  # "simulate" — plan without API call (via .simulate())
Pass the string value directly to the constructor or per-request override:
client = HeadroomClient(..., default_mode="optimize")

# Or override per request:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    headroom_mode="audit",
)

with_memory()

with_memory() is an optional function that enables hierarchical memory for multi-turn or multi-agent workflows. It requires the memory extra: pip install headroom-ai[memory].
from headroom import with_memory, HeadroomClient, OpenAIProvider
from openai import OpenAI

base_client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

# Attach user-scoped memory
memory_client = with_memory(base_client, user_id="user_123")

# Now requests automatically inject relevant memories from previous sessions
response = memory_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What did we discuss last week?"}],
)
with_memory is None when the memory extra is not installed. Always check if with_memory is not None before calling it in environments where the extra may be absent.

Context Manager

HeadroomClient supports the context manager protocol to automatically close the storage connection:
with HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
) as client:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
# Storage connection closed automatically

Error Handling

from headroom import (
    HeadroomClient, OpenAIProvider,
    ConfigurationError, ProviderError, CompressionError,
)
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[...],
    )
except ConfigurationError as e:
    print(f"Config issue: {e.details}")
except ProviderError as e:
    print(f"Provider issue (unknown model?): {e}")
except CompressionError as e:
    print(f"Compression failed: {e}")
All Headroom exceptions extend HeadroomError and include a details dict with additional context.

Build docs developers (and LLMs) love