Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

Headroom’s configuration layer is built from composable Python dataclasses. The top-level HeadroomConfig aggregates sub-configs for every subsystem — compression, cache alignment, cache optimization, CCR, and prefix freezing. All fields have sensible defaults; you only need to specify what you want to change.
from headroom import HeadroomClient, HeadroomConfig, SmartCrusherConfig, OpenAIProvider
from openai import OpenAI

config = HeadroomConfig(
    default_mode="optimize",
    smart_crusher=SmartCrusherConfig(max_items_after_crush=20),
    output_buffer_tokens=2000,
)

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    config=config,
)

HeadroomMode

HeadroomMode is a string enum that controls how the pipeline processes requests. It extends str so string literals work wherever the enum is expected.
from headroom import HeadroomMode

HeadroomMode.AUDIT     # "audit"    — observe only, never modify messages
HeadroomMode.OPTIMIZE  # "optimize" — apply the full compression pipeline
HeadroomMode.SIMULATE  # "simulate" — dry-run; returned by .simulate()
AUDIT
"audit"
Pass-through mode. Messages are analyzed for waste signals and metrics are recorded, but nothing is modified. Use this to measure savings before enabling compression.
OPTIMIZE
"optimize"
Full pipeline mode. Messages are compressed by SmartCrusher (JSON), Kompress (text), CacheAligner (prefix), and the provider cache optimizer before being sent.
SIMULATE
"simulate"
Dry-run mode. The pipeline runs completely but no API call is made. Used internally by client.chat.completions.simulate() and client.messages.simulate().

HeadroomConfig

Top-level configuration for HeadroomClient. All fields are optional with production-ready defaults.
from headroom import HeadroomConfig, HeadroomMode

config = HeadroomConfig(
    store_url="sqlite:///myapp.db",
    default_mode=HeadroomMode.OPTIMIZE,
    output_buffer_tokens=2000,
)
store_url
str
default:"\"sqlite:///headroom.db\""
Storage URL for the metrics database. Supports sqlite:///path and jsonl:///path. When passed via HeadroomClient(store_url=...), that value overrides this field.
default_mode
HeadroomMode
default:"HeadroomMode.AUDIT"
Default operating mode for all requests. Use HeadroomMode.OPTIMIZE for production workloads.
model_context_limits
dict[str, int]
default:"{}"
User-supplied overrides for model context windows (in tokens). Takes precedence over the provider’s built-in limits. Prefix matching is supported — "gpt-4" matches "gpt-4-turbo".
smart_crusher
SmartCrusherConfig
default:"SmartCrusherConfig()"
Configuration for the JSON/array compressor. See SmartCrusherConfig.
cache_aligner
CacheAlignerConfig
default:"CacheAlignerConfig()"
Configuration for the cache-prefix stability detector. See CacheAlignerConfig.
cache_optimizer
CacheOptimizerConfig
default:"CacheOptimizerConfig()"
Configuration for provider-specific cache optimization (breakpoints, prefix stabilization). See CacheOptimizerConfig.
ccr
CCRConfig
default:"CCRConfig()"
Configuration for Compress-Cache-Retrieve — reversible compression with hash-based retrieval. Enabled by default.
prefix_freeze
PrefixFreezeConfig
default:"PrefixFreezeConfig()"
Configuration for cache-aware prefix freezing. Prevents the pipeline from invalidating already-cached prefixes.
output_buffer_tokens
int
default:"4000"
Tokens reserved for the model’s output when computing how much of the input context can be compressed. Increase this for models with long outputs (e.g. code generation).
intercept_tool_results
bool
default:"false"
Enable tool-result interceptors (e.g. ast-grep Read outline). Opt-in. Also controllable via the environment variable HEADROOM_INTERCEPT_ENABLED=1.
generate_diff_artifact
bool
default:"false"
When True, each TransformResult includes a DiffArtifact with per-transform token deltas. Useful for debugging which transform caused the most savings.
pipeline_extensions
list[Any]
default:"[]"
List of PipelineExtension instances to attach to the canonical pipeline lifecycle. Extensions receive PipelineEvent objects at each stage.
discover_pipeline_extensions
bool
default:"true"
When True, Headroom discovers and loads PipelineExtension implementations registered under the headroom.pipeline_extension entry-point group.

SmartCrusherConfig

Controls the statistical JSON and array compressor. SmartCrusher is the primary tool for reducing large tool outputs — it preserves errors, anomalies, and query-relevant items while dropping redundant entries.
from headroom import SmartCrusherConfig

config = SmartCrusherConfig(
    min_tokens_to_crush=200,
    max_items_after_crush=20,
    variance_threshold=2.0,
    preserve_change_points=True,
)
enabled
bool
default:"true"
Enable or disable SmartCrusher. When False, all JSON/array tool outputs pass through unmodified.
min_items_to_analyze
int
default:"5"
Minimum array length before statistical analysis runs. Arrays shorter than this are left unchanged.
min_tokens_to_crush
int
default:"200"
Only compress a tool output if it exceeds this many tokens. Prevents unnecessary analysis on small payloads.
variance_threshold
float
default:"2.0"
Standard deviations above the mean required to flag a numeric value as an anomaly. Lower values catch more anomalies.
uniqueness_threshold
float
default:"0.1"
Fraction of unique values below which an array is considered “nearly constant”. Nearly-constant arrays use stricter deduplication.
similarity_threshold
float
default:"0.8"
String similarity threshold for clustering similar items. Items above this similarity may be grouped and represented by a single representative.
max_items_after_crush
int
default:"15"
Target maximum number of items to keep after compression. The adaptive Kneedle algorithm may keep fewer when information saturation is detected earlier.
preserve_change_points
bool
default:"true"
Keep items at significant data transitions (detected with a fixed 5-item window). Useful for time-series data where inflection points carry information.
factor_out_constants
bool
default:"false"
Disabled — would modify the original JSON schema. Kept for forward compatibility.
include_summaries
bool
default:"false"
Disabled — no AI-generated summary text is inserted. All output items come from the original array.
use_feedback_hints
bool
default:"true"
Use TOIN (Tool Output Intelligence Network) learned patterns to bias compression toward preserving historically-retrieved items.
toin_confidence_threshold
float
default:"0.3"
Minimum TOIN confidence score for a hint to influence compression.
dedup_identical_items
bool
default:"true"
Prevent multiple preservation mechanisms from keeping duplicate copies of identical items.
first_fraction
float
default:"0.3"
Fraction of max_items_after_crush reserved for items at the start of the array.
last_fraction
float
default:"0.15"
Fraction of max_items_after_crush reserved for items at the end of the array.
lossless_min_savings_ratio
float
default:"0.15"
Minimum byte-savings ratio for the lossless compaction path (CSV/JSON/markdown-kv) to be chosen over the lossy row-drop path. Must stay in lockstep with the Rust core default.
lossless_only
bool
default:"false"
When True, lossless tabular compaction still runs but any path that would produce a CCR marker is skipped. Output is always marker-free and byte-recoverable.
relevance
RelevanceScorerConfig
default:"RelevanceScorerConfig()"
Configuration for the relevance scorer that determines which items match the user’s query. See RelevanceScorerConfig.
anchor
AnchorConfig
default:"AnchorConfig()"
Configuration for dynamic anchor allocation — controls how position-based preservation slots are distributed (front-heavy for search results, back-heavy for logs, balanced for time-series). The anchor budget is a percentage of max_items_after_crush reserved for positional anchors; the rest goes to importance-scored items.
compaction_core_field_fraction
float
default:"0.8"
A field is considered “core” if it is present in at least this fraction of rows. Arrays with mostly non-core key sets are bucketed by a discriminator field rather than flattened.
compaction_heterogeneous_core_ratio
float
default:"0.6"
When the fraction of rows sharing a common core is below this value, the array is treated as heterogeneous and bucketed rather than compacted with a shared header.
compaction_max_flatten_inner_keys
int
default:"6"
Maximum number of inner keys to inline when flattening nested objects during tabular compaction.
compaction_min_buckets
int
default:"2"
Minimum number of discriminator buckets used when compacting a heterogeneous array.
compaction_max_buckets
int
default:"8"
Maximum number of discriminator buckets. Prevents over-splitting sparse arrays.

CacheAlignerConfig

Controls the cache-prefix stability detector. CacheAligner scans system messages for volatile content (UUIDs, timestamps, JWTs, hex hashes) and logs warnings when instability is detected. It does not modify messages — it only emits warnings and cache metrics for observability.
from headroom import CacheAlignerConfig

config = CacheAlignerConfig(
    enabled=True,
    use_dynamic_detector=True,
    detection_tiers=["regex"],
    entropy_threshold=0.7,
    normalize_whitespace=True,
    collapse_blank_lines=True,
)
enabled
bool
default:"false"
Enable the CacheAligner. Disabled by default because prefix stability gains are marginal in most workloads. Enable explicitly when debugging cache-miss issues.
use_dynamic_detector
bool
default:"true"
When True, uses the full DynamicContentDetector with 15+ structural patterns (UUIDs, API keys, JWTs, timestamps, hex hashes, version numbers, high-entropy strings). When False, falls back to legacy date-only regex patterns.
detection_tiers
list[Literal["regex", "ner", "semantic"]]
default:"[\"regex\"]"
Detection tiers to run (only when use_dynamic_detector=True):
  • "regex" — Fast structural/universal patterns (~0 ms). Recommended for production.
  • "ner" — Named Entity Recognition via spaCy (~5–10 ms). Optional.
  • "semantic" — Embedding similarity (~20–50 ms). Optional.
extra_dynamic_labels
list[str]
default:"[]"
Additional key names that hint their values are dynamic. For example, adding "session" will detect "session: abc123" and flag "abc123" as volatile.
entropy_threshold
float
default:"0.7"
Entropy threshold (0–1) for identifying random-looking strings. Higher values are more selective (only very random strings like UUIDs). Lower values are more aggressive.
normalize_whitespace
bool
default:"true"
Normalize whitespace in system messages to improve prefix stability. Caution: may break code blocks with significant indentation or ASCII art.
collapse_blank_lines
bool
default:"true"
Collapse consecutive blank lines to single blank lines.
dynamic_tail_separator
str
default:"\"\\n\\n---\\n[Dynamic Context]\\n\""
Separator marking where dynamic content begins in the system message. Content before this separator is the stable cacheable prefix; content after is dynamic.

CacheOptimizerConfig

Controls provider-specific cache optimization — Anthropic cache_control breakpoints, OpenAI prefix stabilization, and Google CachedContent API lifecycle management.
enabled
bool
default:"true"
Enable provider-specific cache optimization. Auto-detects the provider from the HeadroomClient provider instance.
auto_detect_provider
bool
default:"true"
Automatically select the cache optimizer implementation based on the provider name.
min_cacheable_tokens
int
default:"1024"
Minimum token count for a prefix to be considered cacheable. Provider may enforce a higher minimum.
enable_semantic_cache
bool
default:"false"
Enable query-level semantic caching within the optimizer layer. Requires the semantic cache extra.
semantic_cache_similarity
float
default:"0.95"
Minimum cosine similarity for a semantic cache hit.
semantic_cache_max_entries
int
default:"1000"
Maximum number of entries in the semantic cache.
semantic_cache_ttl_seconds
int
default:"300"
Time-to-live for semantic cache entries in seconds.

RelevanceScorerConfig

Controls how SmartCrusher scores items by relevance to the user’s query. Available scoring tiers are BM25 (zero dependencies), embedding-based (requires headroom-ai[relevance]), and hybrid (recommended).
from headroom import RelevanceScorerConfig

config = RelevanceScorerConfig(
    tier="hybrid",
    relevance_threshold=0.25,
    hybrid_alpha=0.5,
    adaptive_alpha=True,
)
tier
"bm25" | "embedding" | "hybrid"
default:"\"hybrid\""
Scoring method. "hybrid" combines BM25 keyword matching with semantic embeddings and is the recommended default. Falls back to BM25 if sentence-transformers is not installed.
bm25_k1
float
default:"1.5"
BM25 term-frequency saturation parameter.
bm25_b
float
default:"0.75"
BM25 length normalization parameter.
embedding_model
str
default:"ML_MODEL_DEFAULTS.sentence_transformer"
HuggingFace model name for the embedding scorer. Default is the Headroom-recommended sentence transformer.
hybrid_alpha
float
default:"0.5"
BM25 weight in the hybrid scorer. 1 - hybrid_alpha is the embedding weight. 0.5 = equal weight.
adaptive_alpha
bool
default:"true"
Dynamically adjust hybrid_alpha based on query type (keyword-heavy vs. semantic).
relevance_threshold
float
default:"0.25"
Minimum relevance score for an item to be considered relevant and kept. Lower = safer (keeps more); higher = more aggressive.

Data Model Types

Block

Atomic unit of context analysis. Each message is parsed into one or more Block objects.
kind
Literal["system", "user", "assistant", "tool_call", "tool_result", "rag", "unknown"]
The semantic type of this block.
text
str
Text content of the block.
tokens_est
int
Estimated token count for this block.
content_hash
str
Short hash of the block content for deduplication.
source_index
int
Position (index) of the originating message in the messages list.
flags
dict[str, Any]
Arbitrary flags set by analyzers (e.g. {"is_error": True}, {"is_anomaly": True}).

RequestMetrics

Comprehensive per-request metrics stored in the database after each call.
request_id
str
Unique identifier for this request.
timestamp
datetime
UTC timestamp when the request was processed.
model
str
Model name used for this request.
stream
bool
Whether the response was streamed.
mode
str
Operating mode: "audit", "optimize", or "simulate".
tokens_input_before
int
Input token count before compression.
tokens_input_after
int
Input token count after compression.
tokens_output
int | None
Output tokens from the model response. None for streaming requests (unknown at request time).
block_breakdown
dict[str, int]
Token counts by block type (system, user, assistant, tool_result, etc.).
waste_signals
dict[str, int]
Detected waste by category. See WasteSignals.to_dict() for key names.
stable_prefix_hash
str
16-character hash of the stable cache prefix. Compare across requests to detect cache misses.
cache_alignment_score
float
Score from 0.0 to 1.0 indicating how cache-friendly the prefix is.
cache_optimizer_used
str | None
Name of the cache optimizer that ran (e.g. "anthropic-cache-optimizer"), or None.
cache_optimizer_strategy
str | None
Strategy name used by the cache optimizer (e.g. "explicit_breakpoints").
cacheable_tokens
int
Number of tokens eligible for provider-side caching.
breakpoints_inserted
int
Number of cache_control breakpoints inserted (Anthropic only).
estimated_cache_hit
bool
Whether the prefix hash matched the previous request, suggesting a cache hit.
estimated_savings_percent
float
Estimated percentage savings if the provider cache was hit.
semantic_cache_hit
bool
Whether a semantic cache hit was returned instead of calling the API.
transforms_applied
list[str]
Names of all transforms that ran for this request.
messages_hash
str
Hash of the original messages for change detection.
error
str | None
Error message if the request failed, otherwise None.

TransformResult

Output of a pipeline or individual transform operation.
messages
list[dict[str, Any]]
Messages after the transform was applied.
tokens_before
int
Token count before this transform.
tokens_after
int
Token count after this transform.
transforms_applied
list[str]
Names of every sub-transform that ran.
markers_inserted
list[str]
CCR retrieval markers that were injected into messages.
warnings
list[str]
Non-fatal warnings emitted during the transform (e.g. detected volatile content in the prefix).
diff_artifact
DiffArtifact | None
Per-transform diff details. Populated only when HeadroomConfig.generate_diff_artifact=True.
cache_metrics
CachePrefixMetrics | None
Cache prefix stability metrics from CacheAligner.
timing
dict[str, float]
Wall-clock time in milliseconds per transform name.
transforms_summary
dict[str, int]
Property — counted summary of transforms_applied. Example: {"router:tool_result:json": 4}.

SimulationResult

Returned by client.chat.completions.simulate() and client.messages.simulate(). Contains projected compression metrics without any API call.
tokens_before
int
Token count before compression.
tokens_after
int
Projected token count after compression.
tokens_saved
int
tokens_before - tokens_after.
transforms
list[str]
Transforms that would be applied.
estimated_savings
str
Human-readable cost estimate per request, e.g. "$0.0042 per request".
messages_optimized
list[dict[str, Any]]
The projected compressed messages.
block_breakdown
dict[str, int]
Token counts by block type.
waste_signals
dict[str, int]
Waste by category.
stable_prefix_hash
str
16-character prefix hash after optimization.
cache_alignment_score
float
Cache-friendliness score (0–1).

Build docs developers (and LLMs) love