Connect Onyx to OpenAI, Anthropic, Azure OpenAI, Google Vertex AI, AWS Bedrock, Ollama, and any LiteLLM-compatible model.
Onyx routes all chat, search summarization, and fast inline tasks through one or more LLM providers that you configure. You can set different models for different roles—a powerful model for chat, a faster one for query understanding, and a locally hosted model for sensitive workloads.
In the Onyx admin panel, go to Settings → LLM Providers.
2
Choose a provider type
Click Add Provider and select a provider from the list. Onyx ships built-in support for OpenAI, Anthropic, Azure OpenAI, Google Vertex AI, Amazon Bedrock, Ollama, LM Studio, OpenRouter, and LiteLLM Proxy.
3
Enter credentials
Fill in the API key (and any provider-specific fields such as base URL, region, or deployment name). Each provider section below lists exactly what is required.
4
Select models
After saving the provider, Onyx fetches the available model list. Mark the models you want to expose to users as visible, then set the default model for that provider.
5
Assign model roles
On the main LLM Providers page, designate which configured provider+model acts as the default chat model, the fast model (used for query rewriting and intent detection), and the embedding model (used during document indexing).
Only admins with access to Settings → LLM Providers can add or change providers. Changing the embedding model triggers a full reindex of all documents.
Onyx automatically fetches the full list of available chat completion models from OpenAI. Timestamped model variants are hidden by default to keep the selection list clean.
Available models include claude-opus-4-6, claude-sonnet-4-6, claude-opus-4-5, and claude-sonnet-4-5. Deprecated Claude 2 and Claude Instant models are excluded automatically.
The deployment name you enter must match exactly the deployment name in Azure AI Foundry. You can create multiple Onyx providers—one per deployment—to expose different Azure-deployed models.
Onyx fetches the available Bedrock model list directly from AWS. This includes Anthropic Claude, Amazon Titan, Meta Llama, Mistral, and Cohere models that you have enabled in your AWS account.
If your Onyx deployment runs on EC2 or ECS with an IAM instance role, you can use IAM authentication instead of static keys. Set USE_IAM_AUTH=true in your environment and leave the key fields empty.
Ollama lets you run open-weight models locally. Onyx queries your Ollama server for its list of pulled models.Example configuration
Base URL: http://ollama-host:11434
Pull models on your Ollama server before adding the provider in Onyx:
ollama pull llama3.3ollama pull nomic-embed-text
After saving the provider, Onyx will automatically list all models that Ollama reports as available.
The Ollama server must be network-accessible from the Onyx backend container. If both run on the same host via Docker Compose, use the Docker gateway IP or the service name defined in your compose file.
If you run a self-managed LiteLLM proxy in front of multiple providers, point Onyx at it. Onyx will query the proxy for its model list.Example configuration
Base URL: http://litellm-proxy:4000API Key: sk-...
Onyx uses models in three distinct roles. Each role can be assigned to any configured provider and model.
Default chat model
Used for all user-facing conversations, document Q&A, and assistant responses. Choose the most capable model you are comfortable paying for.Recommended:gpt-5.4, claude-opus-4-6, or gemini-3-pro-preview
Fast model
Used for latency-sensitive background steps: query rewriting, intent classification, and determining whether a document is relevant to a question. Should be a smaller, faster, cheaper model.Recommended:gpt-5.2, claude-sonnet-4-6, or gemini-3-flash-preview
Embedding model
Used at indexing time to convert document chunks into vector representations, and at search time to embed the user’s query. Changing this model requires re-indexing all documents.Embeddings can be generated by a locally hosted model server (the default for self-hosted Onyx) or by an API-based model. Configure the embedding model in Settings → Model Configuration.
Onyx routes all LLM calls through LiteLLM, which supports over 100 providers. If the provider you need is not listed in the Onyx admin UI, use the LiteLLM Proxy provider type and point it at a self-managed LiteLLM proxy that you have configured for your target provider.
You can also set LITELLM_CUSTOM_ERROR_MESSAGE_MAPPINGS in your .env file to translate provider-specific error strings into user-friendly messages shown in the chat UI.
When Contextual RAG is enabled (ENABLE_CONTEXTUAL_RAG=true), Onyx uses an additional LLM call per document chunk during indexing to generate context summaries. This uses the fast model by default (gpt-4o-mini). Enable this feature only if you are comfortable with the additional API cost and indexing latency.