Documentation Index
Fetch the complete documentation index at: https://mintlify.com/headroomlabs-ai/headroom/llms.txt
Use this file to discover all available pages before exploring further.
Headroom integrates with LangChain by wrapping your chat model in HeadroomChatModel. Because LangChain callbacks cannot modify messages by design, wrapping the model directly is the correct approach — it intercepts every invoke(), stream(), and astream() call and compresses messages before they reach the underlying LLM. You keep using LangChain exactly as before.
Installation
pip install "headroom-ai[langchain]"
Quick start
Wrap any BaseChatModel in one line:
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like before
response = llm.invoke("Hello!")
# Check savings
print(llm.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 12500, 'average_savings_percent': 45.2}
Works with any LangChain provider:
from langchain_anthropic import ChatAnthropic
llm = HeadroomChatModel(ChatAnthropic(model="claude-sonnet-4-20250514"))
Full chain example
HeadroomChatModel is a drop-in replacement in any LCEL chain:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from headroom.integrations import HeadroomChatModel
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}"),
])
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
chain = prompt | llm | StrOutputParser()
response = chain.invoke({"input": "Summarize these search results..."})
print(llm.get_savings_summary())
Memory integration
HeadroomChatMessageHistory wraps any chat history with automatic compression. Long conversations stay under your token budget:
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
base_history,
compress_threshold_tokens=4000, # Compress when over 4K tokens
keep_recent_turns=5, # Always keep last 5 turns
)
memory = ConversationBufferMemory(chat_memory=compressed_history)
After usage:
print(compressed_history.get_compression_stats())
# {'compression_count': 12, 'total_tokens_saved': 28000}
Retriever integration
HeadroomDocumentCompressor filters retrieved documents by relevance. Retrieve many for recall, keep the best for precision:
from langchain.retrievers import ContextualCompressionRetriever
from headroom.integrations import HeadroomDocumentCompressor
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
compressor = HeadroomDocumentCompressor(
max_documents=10,
min_relevance=0.3,
prefer_diverse=True, # MMR-style diversity
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
# Retrieves 50 docs, returns the best 10
docs = retriever.invoke("What is Python?")
wrap_tools_with_headroom compresses tool outputs before they re-enter the agent’s context. Tool outputs (JSON, logs, search results) can compress by 70–90%:
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom
@tool
def search_database(query: str) -> str:
"""Search the database."""
return json.dumps({"results": [...], "total": 1000})
wrapped_tools = wrap_tools_with_headroom(
[search_database],
min_chars_to_compress=1000,
)
agent = create_openai_tools_agent(llm, wrapped_tools, prompt)
executor = AgentExecutor(agent=agent, tools=wrapped_tools)
Per-tool metrics:
from headroom.integrations import get_tool_metrics
metrics = get_tool_metrics()
print(metrics.get_summary())
# {'total_invocations': 25, 'total_compressions': 18, 'total_chars_saved': 450000}
LangGraph ReAct agent
HeadroomChatModel works directly with LangGraph’s create_react_agent:
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from headroom.integrations import HeadroomChatModel, wrap_tools_with_headroom
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
tools = wrap_tools_with_headroom([search_web, query_database])
agent = create_react_agent(llm, tools)
result = agent.invoke({
"messages": [("user", "Find users who signed up last week")]
})
LangGraph custom graph
Insert a compression node between tool outputs and the agent in a custom StateGraph:
from langgraph.graph import StateGraph, MessagesState, START, END
from headroom.integrations.langchain import create_compress_tool_messages_node
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tools_node)
graph.add_node("compress", create_compress_tool_messages_node(
min_tokens_to_compress=100,
))
# Wire: tools -> compress -> agent
graph.add_edge(START, "agent")
graph.add_edge("tools", "compress")
graph.add_edge("compress", "agent")
Streaming
Full async support — ainvoke() and astream() both compress before the underlying model call:
# Async invoke
response = await llm.ainvoke("Hello!")
# Async streaming
async for chunk in llm.astream("Tell me a story"):
print(chunk.content, end="", flush=True)
LangChain callbacks for observability
HeadroomCallbackHandler tracks token usage across chains without modifying messages. Use it alongside HeadroomChatModel when you need alerting or structured request history:
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomCallbackHandler, HeadroomChatModel
handler = HeadroomCallbackHandler(
log_level="INFO",
token_alert_threshold=10000,
)
# HeadroomChatModel does the compression; the handler does the observability
llm = HeadroomChatModel(
ChatOpenAI(model="gpt-4o", callbacks=[handler])
)
response = llm.invoke("Hello!")
print(f"Total tokens: {handler.total_tokens}")
print(f"Alerts: {handler.alerts}")
print(handler.get_summary())
LangChain callbacks cannot modify messages by design. HeadroomCallbackHandler is for observability only — use HeadroomChatModel for actual compression.
Custom configuration
from headroom import HeadroomConfig, HeadroomMode
config = HeadroomConfig(
default_mode=HeadroomMode.OPTIMIZE,
smart_crusher_target_ratio=0.3,
)
llm = HeadroomChatModel(
ChatOpenAI(model="gpt-4o"),
config=config,
)