Skip to main content
Onyx routes all chat, search summarization, and fast inline tasks through one or more LLM providers that you configure. You can set different models for different roles—a powerful model for chat, a faster one for query understanding, and a locally hosted model for sensitive workloads.

Adding a provider

1

Open LLM settings

In the Onyx admin panel, go to Settings → LLM Providers.
2

Choose a provider type

Click Add Provider and select a provider from the list. Onyx ships built-in support for OpenAI, Anthropic, Azure OpenAI, Google Vertex AI, Amazon Bedrock, Ollama, LM Studio, OpenRouter, and LiteLLM Proxy.
3

Enter credentials

Fill in the API key (and any provider-specific fields such as base URL, region, or deployment name). Each provider section below lists exactly what is required.
4

Select models

After saving the provider, Onyx fetches the available model list. Mark the models you want to expose to users as visible, then set the default model for that provider.
5

Assign model roles

On the main LLM Providers page, designate which configured provider+model acts as the default chat model, the fast model (used for query rewriting and intent detection), and the embedding model (used during document indexing).
Only admins with access to Settings → LLM Providers can add or change providers. Changing the embedding model triggers a full reindex of all documents.

Supported providers

ChatGPT (OpenAI)

FieldValue
Provider name (internal)openai
RequiredAPI key
OptionalCustom base URL (for proxies)
Obtain your API key from platform.openai.com/api-keys.Recommended models
RoleModel
Default chatgpt-5.4
Fast modelgpt-5.2
Example configuration
API Key: sk-proj-...
Onyx automatically fetches the full list of available chat completion models from OpenAI. Timestamped model variants are hidden by default to keep the selection list clean.

Model roles

Onyx uses models in three distinct roles. Each role can be assigned to any configured provider and model.
Used for all user-facing conversations, document Q&A, and assistant responses. Choose the most capable model you are comfortable paying for.Recommended: gpt-5.4, claude-opus-4-6, or gemini-3-pro-preview
Used for latency-sensitive background steps: query rewriting, intent classification, and determining whether a document is relevant to a question. Should be a smaller, faster, cheaper model.Recommended: gpt-5.2, claude-sonnet-4-6, or gemini-3-flash-preview
Used at indexing time to convert document chunks into vector representations, and at search time to embed the user’s query. Changing this model requires re-indexing all documents.Embeddings can be generated by a locally hosted model server (the default for self-hosted Onyx) or by an API-based model. Configure the embedding model in Settings → Model Configuration.

Using any LiteLLM-compatible provider

Onyx routes all LLM calls through LiteLLM, which supports over 100 providers. If the provider you need is not listed in the Onyx admin UI, use the LiteLLM Proxy provider type and point it at a self-managed LiteLLM proxy that you have configured for your target provider.
You can also set LITELLM_CUSTOM_ERROR_MESSAGE_MAPPINGS in your .env file to translate provider-specific error strings into user-friendly messages shown in the chat UI.

Contextual RAG and fast model usage

When Contextual RAG is enabled (ENABLE_CONTEXTUAL_RAG=true), Onyx uses an additional LLM call per document chunk during indexing to generate context summaries. This uses the fast model by default (gpt-4o-mini). Enable this feature only if you are comfortable with the additional API cost and indexing latency.

Build docs developers (and LLMs) love