Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt

Use this file to discover all available pages before exploring further.

Headroom integrates with Agno (formerly Phidata) by wrapping your Agno model in HeadroomAgnoModel. The wrapper is a full agno.models.base.Model subclass, so it is completely compatible with Agno’s Agent, tool loops, reasoning modes, and streaming — Headroom compresses messages before each API call inside the agent’s own invoke cycle, including after tool results are appended.

Installation

pip install "headroom-ai[agno]" agno

Quick start

Wrap any Agno model and pass it to your Agent:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)

response = agent.run("What's the capital of France?")

print(f"Tokens saved: {model.total_tokens_saved}")
print(model.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3}
Works with any Agno provider:
from agno.models.anthropic import Claude
from agno.models.google import Gemini

claude_model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash"))

Full agent example with tools

Tool outputs (JSON, logs, search results) see the biggest compression gains — typically 70–90% reduction:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckduckgo import DuckDuckGoTools
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))

agent = Agent(
    model=model,
    tools=[DuckDuckGoTools()],
    show_tool_calls=True,
)

response = agent.run("Research the latest AI developments")
print(f"Tokens saved: {model.total_tokens_saved}")

Observability hooks

Use pre-hooks and post-hooks for detailed tracking without modifying your model:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import (
    HeadroomAgnoModel,
    HeadroomPreHook,
    HeadroomPostHook,
)

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))

pre_hook = HeadroomPreHook()
post_hook = HeadroomPostHook(token_alert_threshold=10000)

agent = Agent(
    model=model,
    pre_hooks=[pre_hook],
    post_hooks=[post_hook],
)

response = agent.run("Analyze this large dataset...")

# Check for alerts
if post_hook.alerts:
    print(f"{len(post_hook.alerts)} requests exceeded threshold")
Or use the convenience factory to create both hooks at once:
from headroom.integrations.agno import create_headroom_hooks

pre_hook, post_hook = create_headroom_hooks(
    token_alert_threshold=5000,
    log_level="DEBUG",
)
HeadroomPreHook tracks request counts and timing. Actual token optimization happens inside HeadroomAgnoModel at the model level, not the hook level — hooks are for observability.

Streaming support

HeadroomAgnoModel supports both sync and async streaming:
import asyncio
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

async def process():
    model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))

    # Async single response
    response = await model.aresponse(messages)

    # Async streaming
    async for chunk in model.aresponse_stream(messages):
        print(chunk, end="", flush=True)

asyncio.run(process())

Standalone message optimization

Optimize messages directly without wrapping a model — useful when you need fine-grained control:
from headroom.integrations.agno import optimize_messages

optimized, metrics = optimize_messages(messages, model="gpt-4o")
print(f"Tokens saved: {metrics['tokens_saved']}")
print(f"Transforms applied: {metrics['transforms_applied']}")

Session management

HeadroomAgnoModel accumulates metrics across calls. Reset between sessions with model.reset():
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)

# Session 1
agent.run("First conversation...")
print(model.get_savings_summary())

# Reset for new session
model.reset()

# Session 2 starts fresh
agent.run("Second conversation...")

Agno reasoning modes

HeadroomAgnoModel exposes an underlying_model property for framework introspection — Agno uses this to detect reasoning capabilities:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = OpenAIChat(id="gpt-4o")
wrapped = HeadroomAgnoModel(wrapped_model=model)

agent = Agent(
    model=wrapped,
    reasoning=True,
    reasoning_model=wrapped.underlying_model,  # Use underlying for type detection
)
Claude’s extended thinking and Agno’s reasoning flow are incompatible. Choose one: set thinking={"type": "enabled", "budget_tokens": N} on the Claude model for native extended thinking, or use reasoning=True on the Agent for Agno’s framework-level chain-of-thought — not both. Headroom automatically skips compression for messages that contain extended thinking blocks.

Supported providers

ProviderAgno ModelAuto-Detected
OpenAIOpenAIChat, OpenAILikeYes
AnthropicClaude, AwsBedrockYes
GoogleGemini, VertexAIYes
GroqGroqYes
MistralMistralYes
OllamaOllamaYes

Build docs developers (and LLMs) love