Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
Headroom integrates with LiteLLM as a callback that compresses messages before they reach any provider. One line to enable, and it works with all 100+ LiteLLM-supported providers — OpenAI, Anthropic, Bedrock, Azure, Vertex AI, Groq, Mistral, Ollama, and more.
Installation
pip install headroom-ai litellm
Quick start
Set litellm.callbacks to a list containing a HeadroomCallback instance. Every subsequent completion() or acompletion() call will have its messages compressed automatically:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
litellm.callbacks = [HeadroomCallback()]
# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])
The callback compresses messages in LiteLLM’s async_pre_call_hook before they reach the provider. The response format is unchanged.
How it works
You call litellm.completion()
Normal LiteLLM call with your messages.
HeadroomCallback.async_pre_call_hook fires
Headroom intercepts the call and runs its compression pipeline on the messages.
LiteLLM sends the compressed messages
The smaller payload is forwarded to the selected provider.
Response comes back unchanged
The response format is identical to what you would receive without Headroom.
Full LiteLLM completion example
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
callback = HeadroomCallback(
min_tokens=500, # Skip compression below this threshold
model_limit=200000, # Target context window
)
litellm.callbacks = [callback]
messages = [
{"role": "system", "content": "You are an SRE assistant."},
{"role": "user", "content": large_log_dump},
]
response = litellm.completion(
model="gpt-4o",
messages=messages,
)
print(response.choices[0].message.content)
print(f"Total tokens saved so far: {callback.total_tokens_saved}")
Direct compress() with LiteLLM
You can also use compress() directly instead of the callback, for full control over when and how compression runs:
import litellm
from headroom import compress
messages = [{"role": "user", "content": large_content}]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(
model="bedrock/claude-sonnet",
messages=compressed.messages,
)
print(f"Saved {compressed.tokens_saved} tokens")
Provider routing with LiteLLM + Headroom
One of LiteLLM’s key features is routing the same call to different providers. Headroom compresses messages before routing, so savings apply regardless of which backend handles the request:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
litellm.callbacks = [HeadroomCallback()]
# Route to different providers — Headroom compresses for all of them
providers = [
"gpt-4o",
"anthropic/claude-sonnet-4-5-20250929",
"bedrock/amazon.titan-text-premier-v1:0",
"groq/llama-3.3-70b-versatile",
]
for model in providers:
response = litellm.completion(model=model, messages=messages)
print(f"{model}: {response.choices[0].message.content[:80]}")
With LiteLLM Proxy
If you run LiteLLM as a proxy server, add Headroom as ASGI middleware:
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware
app.add_middleware(CompressionMiddleware)
Or configure it via YAML without any code changes:
# litellm_config.yaml
litellm_settings:
callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]
Response headers from the ASGI middleware include x-headroom-compressed: true and x-headroom-tokens-saved: <n> on every compressed request.
Cloud mode
HeadroomCallback supports a cloud mode that routes compression through Headroom Cloud for managed CCR, TOIN learning, and org-wide analytics dashboards:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
# Cloud mode — compression runs on Headroom Cloud, not locally
litellm.callbacks = [HeadroomCallback(api_key="hdr_xxx")]
You can also set the API key via the HEADROOM_API_KEY environment variable:
export HEADROOM_API_KEY=hdr_xxx
HeadroomCallback only fires on completion and acompletion call types. Embeddings, image generation, and other LiteLLM endpoints pass through unchanged.