Supported LLM, embedding, and reranker models

Hindsight uses three categories of machine learning models. LLMs handle fact extraction, reasoning, and generation. Embedding models convert text to vectors for semantic search. Cross-encoders rerank search results to improve precision.

LLM

Fact extraction, entity resolution, mental model consolidation, and answer synthesis. Fully configurable — 20+ providers supported.

Embeddings

Vector representations for semantic similarity search. Default: BAAI/bge-small-en-v1.5 (384 dimensions).

Reranker

Reranks initial search results to improve precision. Default: cross-encoder/ms-marco-MiniLM-L-6-v2.

Embedding and cross-encoder models are downloaded automatically from HuggingFace on first run.

LLM providers

Hindsight supports over 20 LLM providers out of the box, plus any OpenAI-compatible API and 100+ providers via LiteLLM.

Provider	Value	Notes
OpenAI	`openai`	Includes Flex Processing support
Anthropic	`anthropic`	Claude model family
Gemini	`gemini`	Google Gemini API
Groq	`groq`	High-throughput inference — recommended for retain
Ollama	`ollama`	Local inference server
LM Studio	`lmstudio`	Local GUI-based inference
llama.cpp	`llamacpp`	Built-in managed subprocess, no external server
Vertex AI	`vertexai`	Google Cloud, native GenAI SDK
AWS Bedrock	`bedrock`	Uses AWS credentials, no API key required
LiteLLM	`litellm`	100+ providers via LiteLLM proxy
LiteLLM Router	`litellmrouter`	Fallback chains, load-balancing, per-deployment limits
DeepSeek	`deepseek`	OpenAI-compatible, includes thinking mode
MiniMax	`minimax`	1M context window
OpenRouter	`openrouter`	Access 100+ models via one API key
z.ai	`zai`	Zhipu GLM series
Volcano Engine	`volcano`	ByteDance Doubao models
opencode-go	`opencode-go`	OpenAI-compatible
OpenAI Codex	`openai-codex`	Uses ChatGPT Plus/Pro OAuth (personal dev only)
Claude Code	`claude-code`	Uses Claude Pro/Max OAuth (personal dev only)
None	`none`	Disable LLM — semantic search only

Hindsight works with any provider exposing an OpenAI-compatible API. Set HINDSIGHT_API_LLM_PROVIDER=openai and point HINDSIGHT_API_LLM_BASE_URL at your provider’s endpoint (Azure OpenAI, Together AI, Fireworks, etc.).

Benchmarks

Not sure which model to use? The Model Leaderboard benchmarks models across accuracy, speed, cost, and reliability for retain, reflect, and observation consolidation.

Tested models

The following models have been verified to work correctly with Hindsight:

Provider	Model
OpenAI	`gpt-5.2`
OpenAI	`gpt-5`
OpenAI	`gpt-5-mini`
OpenAI	`gpt-5-nano`
OpenAI	`gpt-4.1-mini`
OpenAI	`gpt-4.1-nano`
OpenAI	`gpt-4o-mini`
Anthropic	`claude-sonnet-4-20250514`
Anthropic	`claude-3-5-sonnet-20241022`
Gemini	`gemini-3-pro-preview`
Gemini	`gemini-2.5-flash`
Gemini	`gemini-2.5-flash-lite`
Groq	`openai/gpt-oss-120b`
Groq	`openai/gpt-oss-20b`

Provider default models

When HINDSIGHT_API_LLM_MODEL is not set, each provider uses a sensible default. Setting just the provider is enough to get started:

# Uses claude-haiku-4-5-20251001 automatically
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx

You can always override the default by setting HINDSIGHT_API_LLM_MODEL explicitly:

export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-5-20250929

Configuration examples

export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b

Models with limited output tokens

Hindsight requires at least 65,000 output tokens for reliable fact extraction. For models that support fewer tokens, reduce the retain completion token limit:

# For models supporting 32k output tokens
export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32000

# For models supporting 16k output tokens
export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=16000

HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS must be greater than HINDSIGHT_API_RETAIN_CHUNK_SIZE (default: 3000).

OpenAI Codex setup (ChatGPT Plus/Pro)

Use your ChatGPT Plus or Pro subscription without separate API costs. For personal development only.

Install Codex CLI

npm install -g @openai/codex

codex auth login

This opens a browser window to authenticate and saves OAuth tokens to ~/.codex/auth.json.

Configure Hindsight

export HINDSIGHT_API_LLM_PROVIDER=openai-codex
# No API key needed — reads from ~/.codex/auth.json automatically

Claude Code setup (Claude Pro/Max)

Use your Claude Pro or Max subscription without separate API costs. For personal development only.

This integration uses your personal Claude credentials. Do not use it in production or shared environments. For production use, use HINDSIGHT_API_LLM_PROVIDER=anthropic with an API key.

Install Claude Code CLI

npm install -g @anthropics/claude-code

claude auth login

Configure Hindsight

export HINDSIGHT_API_LLM_PROVIDER=claude-code
# No API key needed — uses claude auth login credentials

Embedding models

Embedding models convert text into dense vector representations for semantic similarity search. Default: BAAI/bge-small-en-v1.5 (384 dimensions, ~130 MB)

Supported providers

Provider	Description	Best for
`local`	SentenceTransformers (default)	Development, low latency
`openai`	OpenAI embeddings API	Production, high quality
`cohere`	Cohere embeddings API	Production, multilingual
`google`	Gemini API or Vertex AI	Production, high quality
`tei`	HuggingFace Text Embeddings Inference	Production, self-hosted
`litellm`	LiteLLM proxy (unified gateway)	Multi-provider setups
`litellm-sdk`	LiteLLM SDK (direct API, no proxy)	Simpler multi-provider setup

Model reference

Local
OpenAI
Google
Cohere

Model	Dimensions	Use case
`BAAI/bge-small-en-v1.5`	384	Default, fast, good quality
`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384	Multilingual (50+ languages)

Model	Dimensions	Use case
`text-embedding-3-small`	1536	Default OpenAI, cost-effective
`text-embedding-3-large`	3072	Higher quality, more expensive
`text-embedding-ada-002`	1536	Legacy

Model	Dimensions	Use case
`gemini-embedding-001`	768 (configurable)	General purpose

Supports configurable output dimensionality via HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY. Recommended values: 768, 1536, 3072.

Model	Dimensions	Use case
`embed-english-v3.0`	1024	English text
`embed-multilingual-v3.0`	1024	100+ languages

Reranker models

The reranker improves precision by scoring the top candidates retrieved by semantic and keyword search. Default: cross-encoder/ms-marco-MiniLM-L-6-v2 (~85 MB)

Supported providers

Provider	Description	Best for
`local`	SentenceTransformers CrossEncoder (default)	Development, low latency
`cohere`	Cohere Rerank API	Production, high quality
`zeroentropy`	ZeroEntropy zerank-2	Production, state-of-the-art accuracy
`siliconflow`	SiliconFlow Cohere-compatible endpoint	China region, SiliconFlow platform
`tei`	HuggingFace Text Embeddings Inference	Production, self-hosted
`flashrank`	FlashRank ONNX (lightweight, fast)	Resource-constrained environments
`litellm`	LiteLLM proxy	Multi-provider setups
`litellm-sdk`	LiteLLM SDK (direct API, no proxy)	Simpler multi-provider setup
`jina-mlx`	Jina reranker v3, MLX (Apple Silicon)	macOS with Apple Silicon
`rrf`	RRF-only (no neural reranking)	Testing, minimal resources

Model reference

Local
Cohere
ZeroEntropy
SiliconFlow

Model	Use case
`cross-encoder/ms-marco-MiniLM-L-6-v2`	Default, fast
`cross-encoder/ms-marco-MiniLM-L-12-v2`	Higher accuracy
`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`	Multilingual

Model	Use case
`rerank-english-v3.0`	English text
`rerank-multilingual-v3.0`	100+ languages

Model	Use case
`zerank-2`	Flagship multilingual reranker (default)
`zerank-2-small`	Faster, lighter variant

Model	Use case
`BAAI/bge-reranker-v2-m3`	Multilingual, strong default
`Qwen/Qwen3-Reranker-8B`	Larger, higher accuracy

For full configuration options including Azure-hosted endpoints and batch settings, see the Configuration reference.

Get Started

Core Concepts

SDKs & Clients

Integrations

Deployment & Operations

Supported LLM, embedding, and reranker models

LLM

Embeddings

Reranker

LLM providers

Benchmarks

Tested models

Provider default models

Configuration examples

Models with limited output tokens

OpenAI Codex setup (ChatGPT Plus/Pro)

Claude Code setup (Claude Pro/Max)

Embedding models

Supported providers

Model reference

Reranker models

Supported providers

Model reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

SDKs & Clients

Integrations

Deployment & Operations

Documentation Index

LLM

Embeddings

Reranker

​LLM providers

​Benchmarks

​Tested models

​Provider default models

​Configuration examples

​Models with limited output tokens

​OpenAI Codex setup (ChatGPT Plus/Pro)

​Claude Code setup (Claude Pro/Max)

​Embedding models

​Supported providers

​Model reference

​Reranker models

​Supported providers

​Model reference

Build docs developers (and LLMs) love

LLM providers

Benchmarks

Tested models

Provider default models

Configuration examples

Models with limited output tokens

OpenAI Codex setup (ChatGPT Plus/Pro)

Claude Code setup (Claude Pro/Max)

Embedding models

Supported providers

Model reference

Reranker models

Supported providers

Model reference