Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
Headroom integrates with Agno (formerly Phidata) by wrapping your Agno model in HeadroomAgnoModel. The wrapper is a full agno.models.base.Model subclass, so it is completely compatible with Agno’s Agent, tool loops, reasoning modes, and streaming — Headroom compresses messages before each API call inside the agent’s own invoke cycle, including after tool results are appended.
Installation
pip install "headroom-ai[agno]" agno
Quick start
Wrap any Agno model and pass it to your Agent:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)
response = agent.run("What's the capital of France?")
print(f"Tokens saved: {model.total_tokens_saved}")
print(model.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3}
Works with any Agno provider:
from agno.models.anthropic import Claude
from agno.models.google import Gemini
claude_model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash"))
Tool outputs (JSON, logs, search results) see the biggest compression gains — typically 70–90% reduction:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckduckgo import DuckDuckGoTools
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(
model=model,
tools=[DuckDuckGoTools()],
show_tool_calls=True,
)
response = agent.run("Research the latest AI developments")
print(f"Tokens saved: {model.total_tokens_saved}")
Observability hooks
Use pre-hooks and post-hooks for detailed tracking without modifying your model:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import (
HeadroomAgnoModel,
HeadroomPreHook,
HeadroomPostHook,
)
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre_hook = HeadroomPreHook()
post_hook = HeadroomPostHook(token_alert_threshold=10000)
agent = Agent(
model=model,
pre_hooks=[pre_hook],
post_hooks=[post_hook],
)
response = agent.run("Analyze this large dataset...")
# Check for alerts
if post_hook.alerts:
print(f"{len(post_hook.alerts)} requests exceeded threshold")
Or use the convenience factory to create both hooks at once:
from headroom.integrations.agno import create_headroom_hooks
pre_hook, post_hook = create_headroom_hooks(
token_alert_threshold=5000,
log_level="DEBUG",
)
HeadroomPreHook tracks request counts and timing. Actual token optimization happens inside HeadroomAgnoModel at the model level, not the hook level — hooks are for observability.
Streaming support
HeadroomAgnoModel supports both sync and async streaming:
import asyncio
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
async def process():
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Async single response
response = await model.aresponse(messages)
# Async streaming
async for chunk in model.aresponse_stream(messages):
print(chunk, end="", flush=True)
asyncio.run(process())
Standalone message optimization
Optimize messages directly without wrapping a model — useful when you need fine-grained control:
from headroom.integrations.agno import optimize_messages
optimized, metrics = optimize_messages(messages, model="gpt-4o")
print(f"Tokens saved: {metrics['tokens_saved']}")
print(f"Transforms applied: {metrics['transforms_applied']}")
Session management
HeadroomAgnoModel accumulates metrics across calls. Reset between sessions with model.reset():
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)
# Session 1
agent.run("First conversation...")
print(model.get_savings_summary())
# Reset for new session
model.reset()
# Session 2 starts fresh
agent.run("Second conversation...")
Agno reasoning modes
HeadroomAgnoModel exposes an underlying_model property for framework introspection — Agno uses this to detect reasoning capabilities:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = OpenAIChat(id="gpt-4o")
wrapped = HeadroomAgnoModel(wrapped_model=model)
agent = Agent(
model=wrapped,
reasoning=True,
reasoning_model=wrapped.underlying_model, # Use underlying for type detection
)
Claude’s extended thinking and Agno’s reasoning flow are incompatible. Choose one: set thinking={"type": "enabled", "budget_tokens": N} on the Claude model for native extended thinking, or use reasoning=True on the Agent for Agno’s framework-level chain-of-thought — not both. Headroom automatically skips compression for messages that contain extended thinking blocks.
Supported providers
| Provider | Agno Model | Auto-Detected |
|---|
| OpenAI | OpenAIChat, OpenAILike | Yes |
| Anthropic | Claude, AwsBedrock | Yes |
| Google | Gemini, VertexAI | Yes |
| Groq | Groq | Yes |
| Mistral | Mistral | Yes |
| Ollama | Ollama | Yes |