Use Headroom with LangChain Chat Models

Headroom integrates with LangChain by wrapping your chat model in HeadroomChatModel. Because LangChain callbacks cannot modify messages by design, wrapping the model directly is the correct approach — it intercepts every invoke(), stream(), and astream() call and compresses messages before they reach the underlying LLM. You keep using LangChain exactly as before.

Installation

pip install "headroom-ai[langchain]"

Quick start

Wrap any BaseChatModel in one line:

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like before
response = llm.invoke("Hello!")

# Check savings
print(llm.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 12500, 'average_savings_percent': 45.2}

Works with any LangChain provider:

from langchain_anthropic import ChatAnthropic

llm = HeadroomChatModel(ChatAnthropic(model="claude-sonnet-4-20250514"))

Full chain example

HeadroomChatModel is a drop-in replacement in any LCEL chain:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from headroom.integrations import HeadroomChatModel

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}"),
])

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

chain = prompt | llm | StrOutputParser()

response = chain.invoke({"input": "Summarize these search results..."})
print(llm.get_savings_summary())

Memory integration

HeadroomChatMessageHistory wraps any chat history with automatic compression. Long conversations stay under your token budget:

from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,  # Compress when over 4K tokens
    keep_recent_turns=5,             # Always keep last 5 turns
)

memory = ConversationBufferMemory(chat_memory=compressed_history)

After usage:

print(compressed_history.get_compression_stats())
# {'compression_count': 12, 'total_tokens_saved': 28000}

Retriever integration

HeadroomDocumentCompressor filters retrieved documents by relevance. Retrieve many for recall, keep the best for precision:

from langchain.retrievers import ContextualCompressionRetriever
from headroom.integrations import HeadroomDocumentCompressor

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10,
    min_relevance=0.3,
    prefer_diverse=True,  # MMR-style diversity
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

# Retrieves 50 docs, returns the best 10
docs = retriever.invoke("What is Python?")

Agent tool wrapping

wrap_tools_with_headroom compresses tool outputs before they re-enter the agent’s context. Tool outputs (JSON, logs, search results) can compress by 70–90%:

from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_database(query: str) -> str:
    """Search the database."""
    return json.dumps({"results": [...], "total": 1000})

wrapped_tools = wrap_tools_with_headroom(
    [search_database],
    min_chars_to_compress=1000,
)

agent = create_openai_tools_agent(llm, wrapped_tools, prompt)
executor = AgentExecutor(agent=agent, tools=wrapped_tools)

Per-tool metrics:

from headroom.integrations import get_tool_metrics

metrics = get_tool_metrics()
print(metrics.get_summary())
# {'total_invocations': 25, 'total_compressions': 18, 'total_chars_saved': 450000}

LangGraph ReAct agent

HeadroomChatModel works directly with LangGraph’s create_react_agent:

from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from headroom.integrations import HeadroomChatModel, wrap_tools_with_headroom

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
tools = wrap_tools_with_headroom([search_web, query_database])

agent = create_react_agent(llm, tools)
result = agent.invoke({
    "messages": [("user", "Find users who signed up last week")]
})

LangGraph custom graph

Insert a compression node between tool outputs and the agent in a custom StateGraph:

from langgraph.graph import StateGraph, MessagesState, START, END
from headroom.integrations.langchain import create_compress_tool_messages_node

graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tools_node)
graph.add_node("compress", create_compress_tool_messages_node(
    min_tokens_to_compress=100,
))

# Wire: tools -> compress -> agent
graph.add_edge(START, "agent")
graph.add_edge("tools", "compress")
graph.add_edge("compress", "agent")

Streaming

Full async support — ainvoke() and astream() both compress before the underlying model call:

# Async invoke
response = await llm.ainvoke("Hello!")

# Async streaming
async for chunk in llm.astream("Tell me a story"):
    print(chunk.content, end="", flush=True)

LangChain callbacks for observability

HeadroomCallbackHandler tracks token usage across chains without modifying messages. Use it alongside HeadroomChatModel when you need alerting or structured request history:

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomCallbackHandler, HeadroomChatModel

handler = HeadroomCallbackHandler(
    log_level="INFO",
    token_alert_threshold=10000,
)

# HeadroomChatModel does the compression; the handler does the observability
llm = HeadroomChatModel(
    ChatOpenAI(model="gpt-4o", callbacks=[handler])
)

response = llm.invoke("Hello!")

print(f"Total tokens: {handler.total_tokens}")
print(f"Alerts: {handler.alerts}")
print(handler.get_summary())

LangChain callbacks cannot modify messages by design. HeadroomCallbackHandler is for observability only — use HeadroomChatModel for actual compression.

Custom configuration

from headroom import HeadroomConfig, HeadroomMode

config = HeadroomConfig(
    default_mode=HeadroomMode.OPTIMIZE,
    smart_crusher_target_ratio=0.3,
)

llm = HeadroomChatModel(
    ChatOpenAI(model="gpt-4o"),
    config=config,
)

Get Started

Modes of Use

Core Concepts

Features

Integrations

Operations

Use Headroom with LangChain Chat Models

Installation

Quick start

Full chain example

Memory integration

Retriever integration

Agent tool wrapping

LangGraph ReAct agent

LangGraph custom graph

Streaming

LangChain callbacks for observability

Custom configuration

Build docs developers (and LLMs) love

Get Started

Modes of Use

Core Concepts

Features

Integrations

Operations

Documentation Index

​Installation

​Quick start

​Full chain example

​Memory integration

​Retriever integration

​Agent tool wrapping

​LangGraph ReAct agent

​LangGraph custom graph

​Streaming

​LangChain callbacks for observability

​Custom configuration

Build docs developers (and LLMs) love

Installation

Quick start

Full chain example

Memory integration

Retriever integration

Agent tool wrapping

LangGraph ReAct agent

LangGraph custom graph

Streaming

LangChain callbacks for observability

Custom configuration