Use this file to discover all available pages before exploring further.
LLMs forget everything the moment a conversation ends. Headroom’s persistent memory layer solves this by extracting key facts inline during each response, persisting them with vector and full-text indexes, and injecting the most relevant ones automatically on the next turn — turning thousands of tokens of replayed history into a compact, semantically-searched memory store.
The fastest path is with_memory(), which wraps any OpenAI-compatible client in a single line. Memory extraction happens inline as part of the LLM response — no extra API calls, no extra latency.
from openai import OpenAIfrom headroom import with_memory# One line — that's itclient = with_memory(OpenAI(), user_id="alice")# Use exactly like normalresponse = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "I prefer Python for backend work"}])# Memory extracted INLINE — zero extra latency# Later, in a new conversation...response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What language should I use?"}])# Response uses the Python preference from memory
Memories exist at four scope levels, from broadest to narrowest. Facts stored at a broader scope are automatically visible to narrower scopes — a language preference saved at the User level is recalled in every future session.
Scope
Persists Across
Use Case
User
All sessions, all time
Long-term preferences, identity
Session
Current session only
Current task context
Agent
Current agent in session
Agent-specific context
Turn
Single turn only
Ephemeral working memory
from openai import OpenAIfrom headroom import with_memory# Session 1: Morningclient1 = with_memory( OpenAI(), user_id="bob", session_id="morning-session",)response = client1.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}])# Memory stored at USER level (persists across sessions)# Session 2: Afternoon (different session, same user)client2 = with_memory( OpenAI(), user_id="bob", session_id="afternoon-session",)response = client2.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What language for my new microservice?"}])# Recalls Go preference from morning session
from headroom.memory import Memory# No setup required — works out of the boxmemory = Memory()await memory.save("User prefers dark mode and uses Python", user_id="alice")results = await memory.search("What programming language?", user_id="alice")for r in results: print(r.content, r.score)
# Start services firstdocker compose up -d qdrant neo4j
from headroom.memory import Memory# Same API, production-grade backendsmemory = Memory(backend="qdrant-neo4j")await memory.save("User works at Netflix", user_id="alice")
Recommended — fast, no torch, ~30 MB int8-quantized
EmbedderBackend.LOCAL
sentence-transformers (requires PyTorch ~2 GB)
EmbedderBackend.OPENAI
Higher quality, costs money; needs openai_api_key
EmbedderBackend.OLLAMA
Local Ollama server; needs ollama_base_url
On Apple Silicon you can offload embedding to the GPU with pip install "headroom-ai[pytorch-mps]" and export HEADROOM_EMBEDDER_RUNTIME=pytorch_mps. The embedder serializes MPS calls internally — nothing else to configure.
Wraps an existing LLM client. Memory extraction is automatic and inline — the model writes facts in its response and the wrapper strips them before returning. Ideal when you already use the OpenAI SDK.
Memory()
Standalone async API — save(), search(), clear(). Use this when you want full control, or when your stack doesn’t use OpenAI-compatible clients.
from headroom.memory import Memorymemory = Memory()# Save a factawait memory.save( "User prefers dark mode and uses Python", user_id="alice",)# Semantic searchresults = await memory.search( "What programming language?", user_id="alice",)for r in results: print(r.content, r.score)
The with_memory() wrapper also exposes a .memory attribute for direct access:
client = with_memory(OpenAI(), user_id="alice")# Semantic searchresults = client.memory.search("python preferences", top_k=5)# Add a memory manuallyclient.memory.add("User is a senior engineer", importance=0.9)# Get all memories for this userall_memories = client.memory.get_all()# Statsstats = client.memory.stats()print(f"Total memories: {stats['total']}")
The memory store is shared across Claude, Codex, and Gemini. A fact learned in one agent is automatically available to the others — agent provenance is tracked and duplicates are deduplicated.
from headroom.memory import Memorymemory = Memory()# Claude Code stores a factawait memory.save( "Project uses uv, not pip", user_id="team", agent_id="claude",)# Codex recalls it in a different sessionresults = await memory.search( "package manager", user_id="team", agent_id="codex",)# Returns: "Project uses uv, not pip"
For compressed context passing between agents in a single workflow (rather than persistent long-term facts), use SharedContext instead. See Shared Context for details.