Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
HeadroomClient is the full-featured integration path for production applications. It wraps your existing LLM client and intercepts every request, applying compression, cache alignment, and provider-specific cache optimization transparently. The wrapped client is API-compatible with the original — no call-site changes needed.
Constructor
The underlying LLM client to wrap. Accepts any OpenAI-compatible client (e.g.
openai.OpenAI(), openai.AsyncOpenAI()) or Anthropic client (anthropic.Anthropic()). The wrapped client is available as client._original.Provider instance for token counting, context limits, and cost estimation. Use
OpenAIProvider() for OpenAI-compatible APIs, AnthropicProvider() for Anthropic. See Providers below.Storage URL for the metrics database. Supports
sqlite:///path/to/headroom.db and jsonl:///path/to/metrics.jsonl. Defaults to a SQLite file in the system temp directory.Default operating mode for all requests. One of
"audit" (observe only, no modifications), "optimize" (apply full compression pipeline). Can be overridden per request via headroom_mode=.Override the context window size for specific models. Keys are model name strings; values are token counts. The provider’s built-in limits are used for any model not listed here.
Supply a custom cache optimizer instance. When
None and enable_cache_optimizer=True, the optimizer is auto-detected from the provider (e.g. AnthropicCacheOptimizer for Anthropic, OpenAICacheOptimizer for OpenAI).Enable provider-specific cache optimization. When enabled, the client inserts
cache_control breakpoints (Anthropic) or stabilizes the prefix (OpenAI) to maximize provider-side KV-cache hit rates.Enable query-level semantic caching. When a new request is semantically similar to a recent one (above the configured similarity threshold), the cached response is returned immediately without calling the API.
Full
HeadroomConfig object. When provided, takes precedence over store_url, default_mode, and other individual parameters. Use this for fine-grained control over all subsystems.chat.completions.create()
Sends a chat completion request with optional Headroom optimization. Accepts all standard OpenAI parameters plus Headroom-specific overrides prefixed withheadroom_.
Model name for the completion.
Message list in OpenAI format.
Stream the response. Headroom compresses before sending; streaming behavior is identical to the underlying client.
Override the client’s
default_mode for this single request. Values: "audit", "optimize".Target number of tokens to keep in the stable, cache-aligned prefix. Passed to the cache optimizer.
Number of tokens to reserve for the model’s output when sizing the input context. Overrides
HeadroomConfig.output_buffer_tokens (default 4000).Never compress the last N turns of the conversation. Overrides the pipeline’s default
protect_recent setting.Per-tool compression overrides. Keys are tool names; values are profile dicts. For example:
{"important_tool": {"skip_compression": True}}.All remaining keyword arguments are forwarded verbatim to the underlying client (e.g.
temperature, max_tokens, tools, tool_choice).messages.create() (Anthropic style)
When wrapping an Anthropic client, useclient.messages.create() instead:
messages.create() and messages.stream() methods accept the same headroom_* parameters as chat.completions.create(), plus max_tokens.
chat.completions.simulate()
Run the full compression pipeline and return aSimulationResult without making an API call. Use this to preview token savings before enabling optimize mode.
Model to simulate compression for.
Messages to simulate compressing.
Which mode to simulate. Always uses
"optimize" unless overridden.Output buffer to use in the simulation.
Per-tool profiles to apply during simulation.
SimulationResult fields
Token count of the input messages.
Projected token count after compression.
tokens_before - tokens_after.Transforms that would be applied.
Human-readable cost estimate, e.g.
"$0.0042 per request".The projected compressed messages (useful for inspection).
Token breakdown by block type (system, user, assistant, tool_result, etc.).
Detected waste by category (json_bloat, html_noise, whitespace, repetition, etc.).
16-character hash of the stable cache prefix after optimization.
Score between 0 and 1 indicating how cache-friendly the prefix is.
get_stats()
Return in-memory session statistics without querying the database. This is the fast path — no I/O.validate_setup()
Run a health check against all configured subsystems. Call this during startup to catch misconfiguration early.valid boolean and per-subsystem status:
True only when all subsystems pass. False if any required subsystem fails.{"ok": bool, "name": str | None, "error": str | None} — provider token-counting test.{"ok": bool, "url": str, "error": str | None} — read access to the metrics database.{"ok": bool, "mode": str, "error": str | None} — configuration validity check.{"ok": bool, "name": str | None, "error": str | None} — cache optimizer status. A failure here does not set valid=False (cache issues are non-fatal).get_metrics() and get_summary()
For historical analysis, query the metrics database directly:get_metrics() returns list[RequestMetrics] and get_summary() returns a summary dict. See RequestMetrics for the full field listing.
Providers
Providers supply token counting, context limits, and cost estimation. Import them fromheadroom:
Provider protocol, which exposes:
provider.get_token_counter(model)— returns aTokenCounterfor the given modelprovider.get_context_limit(model)— returns the context window size in tokensprovider.estimate_cost(input_tokens, output_tokens, model)— returns cost in USD orNone
HeadroomMode
HeadroomMode is a string enum that controls how the pipeline processes each request.
with_memory()
with_memory() is an optional function that enables hierarchical memory for multi-turn or multi-agent workflows. It requires the memory extra: pip install headroom-ai[memory].
with_memory is None when the memory extra is not installed. Always check if with_memory is not None before calling it in environments where the extra may be absent.Context Manager
HeadroomClient supports the context manager protocol to automatically close the storage connection:
Error Handling
HeadroomError and include a details dict with additional context.