Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vectorize-io/hindsight/llms.txt

Use this file to discover all available pages before exploring further.

When you call recall(), Hindsight runs four retrieval strategies simultaneously against your memory bank, merges their results with Reciprocal Rank Fusion (RRF), and reranks the top candidates with a cross-encoder that evaluates each query-memory pair directly. The result is a ranked list of memories trimmed to fit your token budget — ready to pass directly to an LLM.

Four retrieval strategies

No single search method handles all the ways you might query a memory bank. Hindsight runs all four strategies in parallel and combines their results.
StrategyWhat it findsBest for
SemanticConceptual matches by meaning”Alice’s job” → “Alice works as a software engineer”
Keyword (BM25)Exact terms and proper nouns”Google”, “Alice Chen”, “PostgreSQL”
GraphEntities connected through the knowledge graphIndirect relationships, multi-hop traversal
TemporalFacts tied to specific times”last spring”, date ranges, before/after queries
The semantic strategy embeds the query and finds memories with similar meaning, even when the exact words differ. It handles paraphrasing, synonyms, and conceptual questions: “Bob’s expertise” matches “Bob specializes in machine learning.”

Keyword search (BM25)

The keyword strategy uses BM25 full-text search to find memories that contain specific terms. It is essential for proper nouns, technical identifiers, and unique phrases that semantic search might miss because they are lexically distinct from the query.

Graph traversal

The graph strategy follows entity and causal connections to surface memories that are structurally related to the query rather than textually similar. It can traverse multiple hops: a query about Alice can surface her manager’s decisions by following Alice → team → manager → decisions. Graph scoring combines three signals additively for each candidate:
SignalWhat it rewards
Entity overlapShared named entities between query and memory
Semantic linkPre-computed similarity links in the knowledge graph
Causal linkExplicit cause-effect relationships
The temporal strategy parses time expressions in the query (“last spring”, “in 2023”, “before Alice joined Google”) and filters or boosts memories based on when they occurred. It combines semantic understanding with date filtering so historical queries remain precise.

Fusion and reranking

1

RRF fusion

Results from all four strategies are merged using Reciprocal Rank Fusion. Each memory’s score is computed as the sum of 1 / (60 + rank) across all strategies where it appears. Memories that rank well in multiple strategies score higher — the system rewards consensus without needing scores to be on a comparable scale.
2

Cross-encoder reranking

The top 300 candidates by RRF score are reranked by a cross-encoder that evaluates each query-memory pair together. This catches nuances that rank-based fusion misses — for example, a memory that ranked first in keyword search because it matched a common term but is actually irrelevant to the query’s intent.
3

Scoring boosts

The normalized cross-encoder score is multiplied by three small boosts: recency (memories from the last year score higher), temporal proximity (for time-specific queries), and proof count (observations backed by more evidence score slightly higher). Boosts are capped at ±10% so they nudge rankings without overriding relevance.
4

Token truncation

Results are sorted by final score and selected top-down until the max_tokens budget is exhausted. Only memory text counts toward the budget — metadata is free.

Parameters

query
string
required
The question or search query. Can be a natural language question, a keyword, a time expression, or any combination.
budget
string
default:"mid"
Search depth. Controls how many candidates each strategy considers, how deep graph traversal goes, and how many candidates the cross-encoder reranks.
ValueRecall budgetBest for
low100Fast chatbot responses, simple lookups
mid300Most queries — balanced coverage and speed
high1000Complex multi-hop queries, research tasks
max_tokens
number
default:"4096"
Token budget for returned memory content. The pipeline fills this budget with the highest-scoring memories.
ValueApprox. pagesBest for
2048~2 pagesFocused answers, fast downstream LLM
4096~4 pagesBalanced context
8192~8 pagesComprehensive summaries
types
string[]
Filter by memory type. Accepts any combination of world, experience, observation. Omit to return all types.
metadata_filter
object
Filter memories by metadata key-value pairs set during retain(). Only memories matching all specified fields are returned.
tags
string[]
Filter memories by visibility tags. Combined with tags_match to control whether all tags must match or any tag suffices.
include_chunks
boolean
default:"false"
Return the raw source text that generated each memory alongside the distilled fact. Useful when verbatim quotes or additional nuance are needed.
max_chunk_tokens
number
Token budget for chunk content when include_chunks is true. Applied independently of max_tokens.

Code examples

from hindsight_client import Hindsight

client = Hindsight(base_url="http://localhost:8888")

# Basic recall
result = client.recall(
    bank_id="my-agent",
    query="What programming languages does Alice prefer?",
)
for memory in result.memories:
    print(memory.text, memory.type)

# Recall with budget and token control
result = client.recall(
    bank_id="my-agent",
    query="What happened during Alice's onboarding last spring?",
    budget="high",
    max_tokens=8192,
    types=["world", "experience"],
)

# Recall with tag filtering
result = client.recall(
    bank_id="my-agent",
    query="What are this user's preferences?",
    tags=["user:alice-123"],
    tags_match="all",
    max_tokens=4096,
)

# Recall with source chunks
result = client.recall(
    bank_id="my-agent",
    query="What exactly did Alice say about the API design?",
    include_chunks=True,
    max_chunk_tokens=2048,
)
for memory in result.memories:
    print(memory.text)
    if memory.chunk:
        print("Source:", memory.chunk.text)

Response structure

The recall response includes the ranked memories and token usage metadata:
{
  "memories": [
    {
      "id": "mem-123",
      "text": "Alice prefers Python over JavaScript for data science work",
      "type": "world",
      "score": 0.94,
      "occurred_at": "2024-03-15T10:30:00Z",
      "tags": ["user:alice-123"]
    },
    {
      "id": "obs-456",
      "text": "Alice is a Python-focused developer who values readability and simplicity",
      "type": "observation",
      "score": 0.91,
      "proof_count": 12
    }
  ],
  "usage": {
    "memory_tokens": 312,
    "total_tokens": 312
  }
}
budget and max_tokens are independent controls. budget determines how thoroughly the bank is searched; max_tokens determines how much content is returned. A high budget with low max tokens means deep search returning only the best matches. A low budget with high max tokens means fast search returning everything found.

Build docs developers (and LLMs) love