Use this file to discover all available pages before exploring further.
TrinaxAI is configured entirely through environment variables loaded from a .env file in the repo root. Every variable is optional — the system auto-detects your hardware and applies profile-based defaults. Copy .env.example to .env and override only what you need.
cp .env.example .env
All defaults shown below are for the 16gb profile (the factory default). Actual runtime defaults vary by the active TRINAXAI_PROFILE. Run trinaxai doctor to inspect what values are in effect.
These two variables are the highest-level controls. They cascade into dozens of downstream defaults — model sizes, context windows, embed workers, chunk sizes, and retrieval depth.
Variable
Type
Default
Description
TRINAXAI_PROFILE
string
16gb
Hardware profile. Valid values: 4gb, 8gb, 16gb, max, ultra. Auto-detected by the installer; override manually when needed.
TRINAXAI_PERFORMANCE_MODE
string
fast
Trade-off between speed and retrieval quality. fast — smaller chunks, lower TOP_K, snappier responses. balanced — middle ground. quality — larger context, more retrieval candidates, slower.
Use --profile ultra at install time or set TRINAXAI_PROFILE=max in .env to switch profiles without reinstalling. See Hardware Profiles for the full per-profile breakdown.
Leaving TRINAXAI_HOST=0.0.0.0 exposes the RAG API on your local network. This is intentional for LAN PWA/phone access, but make sure TRINAXAI_ALLOW_LAN_SYSTEM is 0 (the default) unless you understand the implications.
TrinaxAI maintains a fleet of four models and automatically routes each query to the most appropriate one. You can override individual fleet slots or disable the router entirely.
Variable
Type
Default (16gb)
Description
TRINAXAI_MODEL_GENERAL
string
llama3.2:3b
General-purpose chat model for non-code questions. Low-resource profiles default to llama3.2:1b.
TRINAXAI_MODEL_CODE
string
qwen2.5-coder:3b
Code-focused model for regular coding tasks. Low-resource profiles default to qwen2.5-coder:1.5b.
TRINAXAI_MODEL_DEEP
string
qwen2.5-coder:3b
Deep-analysis model for complex refactoring, architecture, and multi-file queries. max/32gb profiles default to qwen2.5-coder:7b; ultra defaults to qwen2.5-coder:14b. Low-resource profiles use qwen2.5-coder:1.5b.
TRINAXAI_MODEL_FAST
string
llama3.2:3b
Lightweight model for short, trivial queries (greetings, simple lookups). Defaults to MODEL_GENERAL.
TRINAXAI_LLM
string
qwen2.5-coder:3b
Fallback model when TRINAXAI_AUTO_ROUTE=0. Defaults to MODEL_CODE.
TRINAXAI_LLM_HEAVY
string
qwen2.5-coder:3b
Heavy fallback model when TRINAXAI_AUTO_ROUTE=0. Defaults to MODEL_DEEP (profile-dependent: 3b on 16gb, 7b on max, 14b on ultra).
TRINAXAI_AUTO_ROUTE
bool
1
Set to 0 to disable the heuristic model router and always use TRINAXAI_LLM.
# Fleet overrides — only set what you want to changeTRINAXAI_MODEL_GENERAL=llama3.2:3bTRINAXAI_MODEL_CODE=qwen2.5-coder:3bTRINAXAI_MODEL_DEEP=qwen2.5-coder:7bTRINAXAI_MODEL_FAST=llama3.2:3b# Disable router and always use a single model# TRINAXAI_AUTO_ROUTE=0# TRINAXAI_LLM=qwen2.5-coder:3b
How auto-routing works
The router classifies each query using an offline heuristic — no LLM call required. It checks for:
Deep hints (query length > 600 chars, words like refactor, architecture, debug, security) → routes to MODEL_DEEP
Code hints (backtick present, keywords like function, import, .py, .ts) → routes to MODEL_CODE
Trivial queries (< 25 chars, greetings) → routes to MODEL_FAST
Controls which embedding model is used for indexing and retrieval. Three presets are available; you can also override the model and dimensions individually.
Variable
Type
Default (16gb)
Description
TRINAXAI_EMBED_PRESET
string
balanced
Embedding preset. balanced = bge-m3 (multilingual, 1024 dims). lite = nomic-embed-text (768 dims, English-leaning). fast = all-minilm (384 dims, smallest). Low-resource profiles default to lite.
TRINAXAI_EMBED
string
bge-m3
Override the embedding model name directly, bypassing the preset.
TRINAXAI_EMBED_DIMS
int
1024
Embedding vector dimensions. Must match the model. Set automatically by preset; override only if using a custom model.
# Use the balanced preset (default for 16gb+)# TRINAXAI_EMBED_PRESET=balanced # bge-m3, 1024 dims, multilingual# TRINAXAI_EMBED_PRESET=lite # nomic-embed-text, 768 dims, faster# TRINAXAI_EMBED_PRESET=fast # all-minilm, 384 dims, smallest# Or override model + dims manually# TRINAXAI_EMBED=bge-m3# TRINAXAI_EMBED_DIMS=1024
Token context window passed to Ollama. Must fit: system prompt + retrieved chunks + response. Auto-scaled by profile: 2048 (low-resource), 4096 (16gb), 8192 (max), 16384 (ultra).
TRINAXAI_NUM_THREAD
int
8
CPU threads allocated per Ollama request. 8 threads avoids over-subscription when several embedding workers run concurrently. Raise on high-core-count machines.
# TRINAXAI_NUM_CTX=4096# TRINAXAI_NUM_THREAD=8
Setting TRINAXAI_NUM_THREAD too high (e.g., equal to your total core count) can make concurrent embeddings slower, not faster, due to thread contention. The default of 8 is intentionally conservative.
Concurrent workers sending embedding requests to Ollama. Profile defaults: 1 (low-resource), 2 (16gb), 4 (max), 6 (ultra). Maximum value: 16.
TRINAXAI_EMBED_BATCH
int
8
Number of text chunks sent to Ollama per embedding call. Larger batches reduce per-request HTTP overhead. Profile defaults: 1 (low-resource), 8 (16gb/max), 16 (ultra).
Controls how many chunks are retrieved and how they are ranked before being passed to the LLM.
Variable
Type
Default (16gb)
Description
TRINAXAI_SIMILARITY_TOP_K
int
4
Final number of chunks injected into the LLM prompt. Profile/mode defaults: 3 (low-resource fast), 4 (16gb fast), 5 (16gb quality / max), 6 (ultra), 8 (ultra quality). Raise for broader answers; lower to reduce latency.
TRINAXAI_FUSION_CANDIDATES
int
8
Candidates each retriever (vector search and BM25 keyword search) contributes before fusion. Profile defaults: 6 (low-resource fast), 8 (16gb fast), 12 (max), 20 (ultra), 32 (ultra quality).
TRINAXAI_RETRIEVAL_CACHE_SECONDS
int
20
TTL (seconds) for cached RAG retrieval results. Identical repeated queries skip re-embedding and re-retrieval. Set to 0 to disable.
TRINAXAI_SOURCES_CACHE_SECONDS
int
30
TTL (seconds) for cached Knowledge Browser source/chunk listings.
Cross-encoder reranking dramatically improves retrieval precision by re-scoring candidates based on their actual relevance to the query. Disabled by default; requires additional Python dependencies.
Variable
Type
Default
Description
TRINAXAI_RERANK
bool
0
Set to 1 to enable cross-encoder reranking. Requires pip install -r requirements-rerank.txt. Loads ~2 GB model into RAM.
TRINAXAI_RERANK_MODEL
string
BAAI/bge-reranker-v2-m3
HuggingFace model ID for the cross-encoder. bge-reranker-v2-m3 is multilingual and state-of-the-art.
TRINAXAI_RERANK_TOP_N
int
(matches TOP_K)
Number of candidates to keep after reranking. Defaults to TRINAXAI_SIMILARITY_TOP_K.
Enable reranking on max or ultra profiles for the best precision. The reranker narrows a large set of FUSION_CANDIDATES down to the RERANK_TOP_N most relevant chunks, which more than compensates for its RAM cost.
Number of GPU layers to offload to the GPU (Ollama native). Leave empty for Ollama auto-detection, or set explicitly (e.g., 35) to control VRAM usage.
TRINAXAI_AGGRESSIVE_QUANT
bool
0
Set to 1 to enable aggressive Q4_K_M-style quantization profile. Reduces RAM usage at some quality cost. Exposes the setting to the /health endpoint and configures Ollama hints accordingly.