Hindsight is architected from the ground up to prioritize read performance over write performance. Memories are written once but read many times, so the system front-loads expensive work during retention to keep retrieval fast.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vectorize-io/hindsight/llms.txt
Use this file to discover all available pages before exploring further.
Design philosophy
All heavy lifting happens at write time so reads are cheap:- Pre-computed embeddings: Generated and indexed during retain, not retrieval
- Fact extraction at write time: LLM-based extraction happens during retain
- Structured memory graphs: Entity relationships and temporal information resolved upfront
- Optimized vector indexes: HNSW or DiskANN for sub-50ms nearest-neighbor search
Typical latencies
| Operation | Typical latency | Primary bottleneck |
|---|---|---|
| Recall | 100–600ms | Cross-encoder reranker (CPU) |
| Reflect | 800–3000ms | LLM generation |
| Retain | 500ms–2000ms per batch | LLM fact extraction |
Retain performance
Retain is slower because it involves LLM-based fact extraction, entity recognition, temporal reasoning, relationship mapping, and embedding generation. The LLM is the primary bottleneck.Choose a high-throughput LLM provider
Hindsight’s fact extraction is structured and well-defined, so smaller models work well. The recommended model isgpt-oss-20b via Groq.
- Self-hosted models on GPU clusters (vLLM, TGI)
- OpenAI with Flex Processing (50% cost savings, variable latency)
- Multiple API keys distributed across retain workers
Batch operations
Send large payloads — up to the HTTP limit — in a single retain request. Hindsight automatically splits large batches (>10,000 tokens) into optimized sub-batches and processes them in parallel. You do not need to manually tune batch sizes.Use async mode for large datasets
For large ingestion jobs, use async retain to queue operations in the background. Enable the Batch API for a 50% reduction in LLM fact-extraction costs:Tune concurrency for rate-limited providers
If your LLM provider has tight rate limits, reduce concurrent requests to avoid errors:Retain environment variables
| Variable | Description | Default |
|---|---|---|
HINDSIGHT_API_RETAIN_MAX_CONCURRENT | Max concurrent retain DB phases | 4 |
HINDSIGHT_API_RETAIN_BATCH_ENABLED | Use LLM Batch API (50% cost, async only) | false |
HINDSIGHT_API_RETAIN_BATCH_TOKENS | Max tokens per sub-batch for auto-splitting | 10000 |
HINDSIGHT_API_RETAIN_CHUNK_SIZE | Max characters per chunk for fact extraction | 3000 |
HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS | Max completion tokens for fact extraction | 64000 |
Recall performance
Recall is fast because all embeddings and indexes are pre-built. Typical query time for vector search on 100K+ facts is 10–50ms. Total recall latency (100–600ms) is dominated by the cross-encoder reranker running on CPU.Retrieval budgets
Thebudget parameter controls search depth. Match the budget to query complexity:
| Budget | Items retrieved per method | Use case |
|---|---|---|
low | 100 | Quick lookups, real-time chat |
mid | 300 (default) | Standard queries, balanced performance |
high | 1000 | Comprehensive analysis, thorough research |
| Variable | Description | Default |
|---|---|---|
HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW | Items per method when budget=low | 100 |
HINDSIGHT_API_RECALL_BUDGET_FIXED_MID | Items per method when budget=mid | 300 |
HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH | Items per method when budget=high | 1000 |
max_tokens, use the adaptive budget function to scale retrieval breadth automatically:
Reduce reranker latency
The cross-encoder reranker is the main latency driver for recall. Options to speed it up:- GPU acceleration
- Optimize local reranker
- Lightweight reranker
- Cloud reranker
Use a TEI server with GPU hardware for high-performance reranking:
Recall environment variables
| Variable | Description | Default |
|---|---|---|
HINDSIGHT_API_RECALL_MAX_CONCURRENT | Max concurrent recall operations per worker | 32 |
HINDSIGHT_API_RECALL_CONNECTION_BUDGET | Max concurrent DB connections per recall operation | 4 |
HINDSIGHT_API_RERANKER_MAX_CANDIDATES | Max candidates to rerank per recall | 300 |
HINDSIGHT_API_RECALL_MAX_QUERY_TOKENS | Max token length of a recall query (HTTP 400 if exceeded) | 500 |
Reflect performance
Reflect latency breaks down into two phases: memory search (100–600ms) and LLM generation (500–2000ms). Total end-to-end latency is typically 600–2600ms.| Component | Latency | Optimization |
|---|---|---|
| Memory search | 100–600ms | Lower budget, faster reranker |
| LLM generation | 500–2000ms | Faster provider or model |
| Total | 600–2600ms | Stream responses for perceived latency |
Optimization strategies
- Use
budget=loworbudget=midwhen the question doesn’t require exhaustive search - Provide relevant
contextin the reflect request to reduce the recall scope - Use a faster LLM provider for reflect while keeping a stronger model for retain
Scaling
Horizontal scaling
Deploy multiple API instances behind a load balancer with a shared PostgreSQL database. Workers process background tasks (retain, consolidation) across all instances without coordination — the task queue is backed by PostgreSQL.Distributed workers
For high-throughput ingestion, run dedicated worker processes separate from the API:| Variable | Description | Default |
|---|---|---|
HINDSIGHT_API_WORKER_MAX_SLOTS | Max concurrent tasks per worker | 10 |
HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS | Reserved slots for consolidation | 2 |
HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS | Reserved slots for retain | 0 |
Per-operation slot values are reservations within
WORKER_MAX_SLOTS, not additive pools. The remaining capacity forms a shared pool usable by any operation type.Read replicas
Use a read replica to offload recall queries from the primary database:Cost optimization
Use efficient models for retain
gpt-oss-20b via Groq handles fact extraction well at low cost. Reserve frontier models for reflect.Enable Batch API
Set
HINDSIGHT_API_RETAIN_BATCH_ENABLED=true with async retain for 50% cost savings on LLM calls (OpenAI and Groq).Control token budgets
Use
budget=low for simple queries. Set max_tokens to limit response size for recall and reflect.Optimize chunk size
Larger chunks (1000–2000 tokens) are more efficient than many small ones — fewer LLM calls per document.
