Performance — tune retain throughput and recall speed

Hindsight is architected from the ground up to prioritize read performance over write performance. Memories are written once but read many times, so the system front-loads expensive work during retention to keep retrieval fast.

Design philosophy

All heavy lifting happens at write time so reads are cheap:

Pre-computed embeddings: Generated and indexed during retain, not retrieval
Fact extraction at write time: LLM-based extraction happens during retain
Structured memory graphs: Entity relationships and temporal information resolved upfront
Optimized vector indexes: HNSW or DiskANN for sub-50ms nearest-neighbor search

This means retain is slower by design, and that’s the right trade-off for memory systems where the read-to-write ratio is typically 10:1 or higher.

Typical latencies

Operation	Typical latency	Primary bottleneck
Recall	100–600ms	Cross-encoder reranker (CPU)
Reflect	800–3000ms	LLM generation
Retain	500ms–2000ms per batch	LLM fact extraction

Retain performance

Retain is slower because it involves LLM-based fact extraction, entity recognition, temporal reasoning, relationship mapping, and embedding generation. The LLM is the primary bottleneck.

Choose a high-throughput LLM provider

Hindsight’s fact extraction is structured and well-defined, so smaller models work well. The recommended model is gpt-oss-20b via Groq.

# Recommended: Groq with gpt-oss-20b
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b

Alternatives for high throughput:

Self-hosted models on GPU clusters (vLLM, TGI)
OpenAI with Flex Processing (50% cost savings, variable latency)
Multiple API keys distributed across retain workers

Batch operations

Send large payloads — up to the HTTP limit — in a single retain request. Hindsight automatically splits large batches (>10,000 tokens) into optimized sub-batches and processes them in parallel. You do not need to manually tune batch sizes.

# Automatic batch splitting for async retain
# Send entire documents or datasets in one call
# Hindsight handles sub-batch optimization and parallel processing

Use async mode for large datasets

For large ingestion jobs, use async retain to queue operations in the background. Enable the Batch API for a 50% reduction in LLM fact-extraction costs:

export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true
# Supported on OpenAI and Groq; results delivered within 24 hours

Tune concurrency for rate-limited providers

If your LLM provider has tight rate limits, reduce concurrent requests to avoid errors:

# Reduce concurrency for retain to stay within rate limits
export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=3
export HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF=2.0
export HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF=120.0

Retain environment variables

Variable	Description	Default
`HINDSIGHT_API_RETAIN_MAX_CONCURRENT`	Max concurrent retain DB phases	`4`
`HINDSIGHT_API_RETAIN_BATCH_ENABLED`	Use LLM Batch API (50% cost, async only)	`false`
`HINDSIGHT_API_RETAIN_BATCH_TOKENS`	Max tokens per sub-batch for auto-splitting	`10000`
`HINDSIGHT_API_RETAIN_CHUNK_SIZE`	Max characters per chunk for fact extraction	`3000`
`HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS`	Max completion tokens for fact extraction	`64000`

Recall performance

Recall is fast because all embeddings and indexes are pre-built. Typical query time for vector search on 100K+ facts is 10–50ms. Total recall latency (100–600ms) is dominated by the cross-encoder reranker running on CPU.

Retrieval budgets

The budget parameter controls search depth. Match the budget to query complexity:

Budget	Items retrieved per method	Use case
`low`	100	Quick lookups, real-time chat
`mid`	300 (default)	Standard queries, balanced performance
`high`	1000	Comprehensive analysis, thorough research

Configure budget thresholds with environment variables:

Variable	Description	Default
`HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW`	Items per method when `budget=low`	`100`
`HINDSIGHT_API_RECALL_BUDGET_FIXED_MID`	Items per method when `budget=mid`	`300`
`HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH`	Items per method when `budget=high`	`1000`

For deployments where callers vary max_tokens, use the adaptive budget function to scale retrieval breadth automatically:

export HINDSIGHT_API_RECALL_BUDGET_FUNCTION=adaptive
# thinking_budget = round(max_tokens * ratio), clamped to [min, max]

Reduce reranker latency

The cross-encoder reranker is the main latency driver for recall. Options to speed it up:

GPU acceleration
Optimize local reranker
Lightweight reranker
Cloud reranker

Use a TEI server with GPU hardware for high-performance reranking:

export HINDSIGHT_API_RERANKER_PROVIDER=tei
export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

Enable FP16 inference and bucket batching for significant CPU speedups:

export HINDSIGHT_API_RERANKER_LOCAL_FP16=true           # 27–36% faster on MPS
export HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING=true # 36–54% faster across models

For resource-constrained environments, use FlashRank or RRF-only:

export HINDSIGHT_API_RERANKER_PROVIDER=flashrank  # Fast ONNX-based reranking
# or
export HINDSIGHT_API_RERANKER_PROVIDER=rrf        # No neural reranking

Offload to a cloud provider for consistent latency without local hardware:

export HINDSIGHT_API_RERANKER_PROVIDER=cohere
export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0

Recall environment variables

Variable	Description	Default
`HINDSIGHT_API_RECALL_MAX_CONCURRENT`	Max concurrent recall operations per worker	`32`
`HINDSIGHT_API_RECALL_CONNECTION_BUDGET`	Max concurrent DB connections per recall operation	`4`
`HINDSIGHT_API_RERANKER_MAX_CANDIDATES`	Max candidates to rerank per recall	`300`
`HINDSIGHT_API_RECALL_MAX_QUERY_TOKENS`	Max token length of a recall query (HTTP 400 if exceeded)	`500`

Reflect performance

Reflect latency breaks down into two phases: memory search (100–600ms) and LLM generation (500–2000ms). Total end-to-end latency is typically 600–2600ms.

Component	Latency	Optimization
Memory search	100–600ms	Lower budget, faster reranker
LLM generation	500–2000ms	Faster provider or model
Total	600–2600ms	Stream responses for perceived latency

Optimization strategies

Use budget=low or budget=mid when the question doesn’t require exhaustive search
Provide relevant context in the reflect request to reduce the recall scope
Use a faster LLM provider for reflect while keeping a stronger model for retain

# Fast model for reflect, strong model for retain
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_MODEL=gpt-4o

export HINDSIGHT_API_REFLECT_LLM_PROVIDER=groq
export HINDSIGHT_API_REFLECT_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_REFLECT_LLM_MODEL=openai/gpt-oss-20b

Scaling

Horizontal scaling

Deploy multiple API instances behind a load balancer with a shared PostgreSQL database. Workers process background tasks (retain, consolidation) across all instances without coordination — the task queue is backed by PostgreSQL.

# Increase worker processes per instance
export HINDSIGHT_API_WORKERS=4

Distributed workers

For high-throughput ingestion, run dedicated worker processes separate from the API:

# Disable internal worker in the API process
export HINDSIGHT_API_WORKER_ENABLED=false

# Run a dedicated worker process
hindsight-worker

Worker slot configuration:

Variable	Description	Default
`HINDSIGHT_API_WORKER_MAX_SLOTS`	Max concurrent tasks per worker	`10`
`HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS`	Reserved slots for consolidation	`2`
`HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS`	Reserved slots for retain	`0`

Per-operation slot values are reservations within WORKER_MAX_SLOTS, not additive pools. The remaining capacity forms a shared pool usable by any operation type.

Read replicas

Use a read replica to offload recall queries from the primary database:

export HINDSIGHT_API_READ_DATABASE_URL=postgresql://user:pass@read-replica:5432/dbname
export HINDSIGHT_API_READ_DB_POOL_MAX_SIZE=100

Cost optimization

Use efficient models for retain

gpt-oss-20b via Groq handles fact extraction well at low cost. Reserve frontier models for reflect.

Enable Batch API

Set HINDSIGHT_API_RETAIN_BATCH_ENABLED=true with async retain for 50% cost savings on LLM calls (OpenAI and Groq).

Control token budgets

Use budget=low for simple queries. Set max_tokens to limit response size for recall and reflect.

Optimize chunk size

Larger chunks (1000–2000 tokens) are more efficient than many small ones — fewer LLM calls per document.

Get Started

Core Concepts

SDKs & Clients

Integrations

Deployment & Operations

Performance — tune retain throughput and recall speed

Design philosophy

Typical latencies

Retain performance

Choose a high-throughput LLM provider

Batch operations

Use async mode for large datasets

Tune concurrency for rate-limited providers

Retain environment variables

Recall performance

Retrieval budgets

Reduce reranker latency

Recall environment variables

Reflect performance

Optimization strategies

Scaling

Horizontal scaling

Distributed workers

Read replicas

Cost optimization

Use efficient models for retain

Enable Batch API

Control token budgets

Optimize chunk size

Build docs developers (and LLMs) love

Get Started

Core Concepts

SDKs & Clients

Integrations

Deployment & Operations

Documentation Index

​Design philosophy

​Typical latencies

​Retain performance

​Choose a high-throughput LLM provider

​Batch operations

​Use async mode for large datasets

​Tune concurrency for rate-limited providers

​Retain environment variables

​Recall performance

​Retrieval budgets

​Reduce reranker latency

​Recall environment variables

​Reflect performance

​Optimization strategies

​Scaling

​Horizontal scaling

​Distributed workers

​Read replicas

​Cost optimization

Use efficient models for retain

Enable Batch API

Control token budgets

Optimize chunk size

Build docs developers (and LLMs) love

Design philosophy

Typical latencies

Retain performance

Choose a high-throughput LLM provider

Batch operations

Use async mode for large datasets

Tune concurrency for rate-limited providers

Retain environment variables

Recall performance

Retrieval budgets

Reduce reranker latency

Recall environment variables

Reflect performance

Optimization strategies

Scaling

Horizontal scaling

Distributed workers

Read replicas

Cost optimization