Skip to main content

Hardware recommendations

Deployment sizeUsersCPURAMDisk
Small1–204 cores16 GB100 GB SSD
Medium20–2008 cores32 GB500 GB SSD
Large200–1000+16+ cores64+ GB1+ TB SSD
Vespa (the vector search index) and the embedding model server are the dominant memory consumers. Budget at least 8 GB for Vespa and 4–8 GB for the model server alone.

Resource-intensive services

Vespa (index)

Holds all document embeddings in memory for fast search. Requires the most RAM — plan 1–2 GB per million document chunks.

model_server

Runs the embedding model (and optionally a local LLM). GPU acceleration dramatically speeds up indexing. CPU-only is fine for small deployments.

background (Celery)

Handles all connector syncing and document processing. Concurrency is tunable via env vars.

api_server

Generally lightweight. Scales horizontally if needed (requires shared Postgres/Redis).

Tuning Celery worker concurrency

Set these in your .env file:
VariableDefaultDescription
CELERY_WORKER_DOCFETCHING_CONCURRENCY1Threads fetching documents from connectors
CELERY_WORKER_DOCPROCESSING_CONCURRENCY6Threads processing and indexing fetched documents
CELERY_WORKER_LIGHT_CONCURRENCY(system default)Threads for lightweight tasks (permission sync, etc.)
Increase these on machines with more CPU cores to speed up bulk indexing. Keep DOCPROCESSING in proportion with the model server’s capacity — pushing too many embedding requests will cause queuing.

Embedding model tradeoffs

Choose your embedding model in Admin → Embeddings:
ModelSpeedAccuracyNotes
nomic-embed-textFastGoodGood default for most deployments
cohere-embed-english-v3.0MediumGreatRequires Cohere API key
text-embedding-3-largeMediumGreatRequires OpenAI API key
Local models (via model_server)VariesGoodAir-gapped deployments
Changing the embedding model requires re-indexing all documents. Plan for downtime or run a parallel index before switching.

Docker Compose vs Kubernetes

  • Teams under ~200 users
  • Single-server deployments
  • Simpler ops with less Kubernetes expertise required
  • Fastest path to production
  • Need horizontal scaling of api_server or background workers
  • Require zero-downtime rolling deployments
  • Multi-tenant Enterprise Edition deployments
  • Already running a Kubernetes cluster for other workloads

Caching

Onyx uses Redis for caching LLM provider configs, user sessions, and feature flags. The default redis:7.4-alpine container works for most deployments. For high-traffic installations, consider:
  • Increasing REDIS_MAXMEMORY (default is 25% of system RAM)
  • Using a managed Redis service (AWS ElastiCache, GCP Memorystore) for reliability

Connector sync frequency

Connector sync intervals are configurable per connector from the Admin UI. For large knowledge bases:
  • Set high-frequency syncs (every 10 min) only for connectors with rapidly changing content (Slack, email)
  • Weekly or daily syncs are sufficient for slower-moving sources (Confluence, Notion, Google Drive)
  • Stagger sync schedules across connectors to avoid spikes in Celery queue depth

Build docs developers (and LLMs) love