Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vectorize-io/hindsight/llms.txt

Use this file to discover all available pages before exploring further.

Hindsight is architected from the ground up to prioritize read performance over write performance. Memories are written once but read many times, so the system front-loads expensive work during retention to keep retrieval fast.

Design philosophy

All heavy lifting happens at write time so reads are cheap:
  • Pre-computed embeddings: Generated and indexed during retain, not retrieval
  • Fact extraction at write time: LLM-based extraction happens during retain
  • Structured memory graphs: Entity relationships and temporal information resolved upfront
  • Optimized vector indexes: HNSW or DiskANN for sub-50ms nearest-neighbor search
This means retain is slower by design, and that’s the right trade-off for memory systems where the read-to-write ratio is typically 10:1 or higher.

Typical latencies

OperationTypical latencyPrimary bottleneck
Recall100–600msCross-encoder reranker (CPU)
Reflect800–3000msLLM generation
Retain500ms–2000ms per batchLLM fact extraction

Retain performance

Retain is slower because it involves LLM-based fact extraction, entity recognition, temporal reasoning, relationship mapping, and embedding generation. The LLM is the primary bottleneck.

Choose a high-throughput LLM provider

Hindsight’s fact extraction is structured and well-defined, so smaller models work well. The recommended model is gpt-oss-20b via Groq.
# Recommended: Groq with gpt-oss-20b
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b
Alternatives for high throughput:
  • Self-hosted models on GPU clusters (vLLM, TGI)
  • OpenAI with Flex Processing (50% cost savings, variable latency)
  • Multiple API keys distributed across retain workers

Batch operations

Send large payloads — up to the HTTP limit — in a single retain request. Hindsight automatically splits large batches (>10,000 tokens) into optimized sub-batches and processes them in parallel. You do not need to manually tune batch sizes.
# Automatic batch splitting for async retain
# Send entire documents or datasets in one call
# Hindsight handles sub-batch optimization and parallel processing

Use async mode for large datasets

For large ingestion jobs, use async retain to queue operations in the background. Enable the Batch API for a 50% reduction in LLM fact-extraction costs:
export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true
# Supported on OpenAI and Groq; results delivered within 24 hours

Tune concurrency for rate-limited providers

If your LLM provider has tight rate limits, reduce concurrent requests to avoid errors:
# Reduce concurrency for retain to stay within rate limits
export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=3
export HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF=2.0
export HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF=120.0

Retain environment variables

VariableDescriptionDefault
HINDSIGHT_API_RETAIN_MAX_CONCURRENTMax concurrent retain DB phases4
HINDSIGHT_API_RETAIN_BATCH_ENABLEDUse LLM Batch API (50% cost, async only)false
HINDSIGHT_API_RETAIN_BATCH_TOKENSMax tokens per sub-batch for auto-splitting10000
HINDSIGHT_API_RETAIN_CHUNK_SIZEMax characters per chunk for fact extraction3000
HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENSMax completion tokens for fact extraction64000

Recall performance

Recall is fast because all embeddings and indexes are pre-built. Typical query time for vector search on 100K+ facts is 10–50ms. Total recall latency (100–600ms) is dominated by the cross-encoder reranker running on CPU.

Retrieval budgets

The budget parameter controls search depth. Match the budget to query complexity:
BudgetItems retrieved per methodUse case
low100Quick lookups, real-time chat
mid300 (default)Standard queries, balanced performance
high1000Comprehensive analysis, thorough research
Configure budget thresholds with environment variables:
VariableDescriptionDefault
HINDSIGHT_API_RECALL_BUDGET_FIXED_LOWItems per method when budget=low100
HINDSIGHT_API_RECALL_BUDGET_FIXED_MIDItems per method when budget=mid300
HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGHItems per method when budget=high1000
For deployments where callers vary max_tokens, use the adaptive budget function to scale retrieval breadth automatically:
export HINDSIGHT_API_RECALL_BUDGET_FUNCTION=adaptive
# thinking_budget = round(max_tokens * ratio), clamped to [min, max]

Reduce reranker latency

The cross-encoder reranker is the main latency driver for recall. Options to speed it up:
Use a TEI server with GPU hardware for high-performance reranking:
export HINDSIGHT_API_RERANKER_PROVIDER=tei
export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

Recall environment variables

VariableDescriptionDefault
HINDSIGHT_API_RECALL_MAX_CONCURRENTMax concurrent recall operations per worker32
HINDSIGHT_API_RECALL_CONNECTION_BUDGETMax concurrent DB connections per recall operation4
HINDSIGHT_API_RERANKER_MAX_CANDIDATESMax candidates to rerank per recall300
HINDSIGHT_API_RECALL_MAX_QUERY_TOKENSMax token length of a recall query (HTTP 400 if exceeded)500

Reflect performance

Reflect latency breaks down into two phases: memory search (100–600ms) and LLM generation (500–2000ms). Total end-to-end latency is typically 600–2600ms.
ComponentLatencyOptimization
Memory search100–600msLower budget, faster reranker
LLM generation500–2000msFaster provider or model
Total600–2600msStream responses for perceived latency

Optimization strategies

  • Use budget=low or budget=mid when the question doesn’t require exhaustive search
  • Provide relevant context in the reflect request to reduce the recall scope
  • Use a faster LLM provider for reflect while keeping a stronger model for retain
# Fast model for reflect, strong model for retain
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_MODEL=gpt-4o

export HINDSIGHT_API_REFLECT_LLM_PROVIDER=groq
export HINDSIGHT_API_REFLECT_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_REFLECT_LLM_MODEL=openai/gpt-oss-20b

Scaling

Horizontal scaling

Deploy multiple API instances behind a load balancer with a shared PostgreSQL database. Workers process background tasks (retain, consolidation) across all instances without coordination — the task queue is backed by PostgreSQL.
# Increase worker processes per instance
export HINDSIGHT_API_WORKERS=4

Distributed workers

For high-throughput ingestion, run dedicated worker processes separate from the API:
# Disable internal worker in the API process
export HINDSIGHT_API_WORKER_ENABLED=false

# Run a dedicated worker process
hindsight-worker
Worker slot configuration:
VariableDescriptionDefault
HINDSIGHT_API_WORKER_MAX_SLOTSMax concurrent tasks per worker10
HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTSReserved slots for consolidation2
HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTSReserved slots for retain0
Per-operation slot values are reservations within WORKER_MAX_SLOTS, not additive pools. The remaining capacity forms a shared pool usable by any operation type.

Read replicas

Use a read replica to offload recall queries from the primary database:
export HINDSIGHT_API_READ_DATABASE_URL=postgresql://user:pass@read-replica:5432/dbname
export HINDSIGHT_API_READ_DB_POOL_MAX_SIZE=100

Cost optimization

Use efficient models for retain

gpt-oss-20b via Groq handles fact extraction well at low cost. Reserve frontier models for reflect.

Enable Batch API

Set HINDSIGHT_API_RETAIN_BATCH_ENABLED=true with async retain for 50% cost savings on LLM calls (OpenAI and Groq).

Control token budgets

Use budget=low for simple queries. Set max_tokens to limit response size for recall and reflect.

Optimize chunk size

Larger chunks (1000–2000 tokens) are more efficient than many small ones — fewer LLM calls per document.

Build docs developers (and LLMs) love