TrinaxAI provides two distinct chat engines and an intelligent auto-router that silently selects the right Ollama model for every message. The router runs entirely offline — no LLM call, no latency — so you always get the best model without waiting.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TrinaxCode/TrinaxAI/llms.txt
Use this file to discover all available pages before exploring further.
Two Chat Engines
RAG Engine
Retrieves relevant chunks from your indexed codebase before generating a response. Every answer includes source citations (file, project, snippet, score). Best for questions about your code or documents you’ve indexed.
Ollama Engine
Sends messages directly to Ollama with no retrieval step. Faster, more creative, and better for general knowledge questions that don’t need codebase context.
Switching Engines
- CLI
- PWA
Auto-Routing Heuristic
When auto-routing is active (TRINAXAI_AUTO_ROUTE=1, the default), TrinaxAI calls route_model() in config.py on every query. This function runs in microseconds with no LLM call:
function, def , class , import, const , react, python, typescript, api, endpoint, sql, bug, error, docker, git, .py, .ts, .js, and more.
Deep hints — keywords that indicate complexity requiring the larger model: refactor, architecture, debug, performance, security, explain in detail, step by step, analyze, review, and more. Messages longer than 600 characters are also routed to the deep model.
Fast route — messages shorter than 25 characters (greetings, one-word questions) use MODEL_FAST to minimise latency.
Model Fleet
The model assigned to each role depends on your hardware profile. All model names are real Ollama model identifiers.| Role | Variable | 8gb Profile | 16gb Profile | max Profile | ultra Profile |
|---|---|---|---|---|---|
| General | TRINAXAI_MODEL_GENERAL | llama3.2:1b | llama3.2:3b | llama3.2:3b | llama3.2:3b |
| Code | TRINAXAI_MODEL_CODE | qwen2.5-coder:1.5b | qwen2.5-coder:3b | qwen2.5-coder:3b | qwen2.5-coder:3b |
| Deep | TRINAXAI_MODEL_DEEP | qwen2.5-coder:1.5b | qwen2.5-coder:3b | qwen2.5-coder:7b | qwen2.5-coder:14b |
| Fast | TRINAXAI_MODEL_FAST | llama3.2:1b | llama3.2:3b | llama3.2:3b | llama3.2:3b |
On the
8gb profile, MODEL_DEEP falls back to MODEL_CODE since there isn’t enough RAM for larger models. On ultra, the deep model scales up to qwen2.5-coder:14b.Hardware Profiles and Context Windows
Each profile sets a defaultNUM_CTX (the Ollama context window in tokens) that fits within the available RAM alongside the model and embeddings.
| Profile | RAM Target | NUM_CTX | Embed Workers | Embed Batch |
|---|---|---|---|---|
8gb | ~8 GB | 2048 | 1 | 1 |
16gb | ~16 GB | 4096 | 2 | 8 |
max | 32 GB+ | 8192 | 4 | 8 |
ultra | 64 GB+ / GPU | 16384 | 6 | 16 |
TRINAXAI_NUM_CTX=<value>. The context window must fit: system prompt + retrieved chunks + conversation history + response.
Streaming SSE Chat
Both the RAG and Ollama engines stream responses to the PWA using Server-Sent Events (SSE). The RAG stream fromPOST /v1/chat/completions emits:
{"trinaxai": {"model": "...", "project": "..."}}— metadata header{"choices": [{"delta": {"content": "token"}}]}— one event per token{"trinaxai_sources": [...]}— source citations after the full responsedata: [DONE]— stream terminator
Conversation History and Context
Each chat session maintains a conversation history inlocalStorage. When you send a message, the last 4 assistant/user turns are included in the synthesis prompt under CONVERSACIÓN PREVIA. This lets the model understand follow-up questions without needing an explicit query rewriter.
The retrieval query is also enriched: it prepends the previous user turn to the current message, so “and what about the tests?” correctly retrieves test-related chunks even though the current message alone has no context.
Model Keep-Alive
TRINAXAI_KEEP_ALIVE controls how long Ollama keeps a model loaded in RAM after responding. Keeping the model warm avoids the reload cost (~1–5 seconds) on the next request.
| Profile | Default Keep-Alive |
|---|---|
8gb | 0s (unload immediately — RAM is tight) |
16gb | 10m (in fast mode) |
max | 30m |
ultra | 60m |
TRINAXAI_EMBED_KEEP_ALIVE, default 15m) because it’s called frequently during indexing and search — keeping it loaded prevents sawtooth RAM usage during batch operations.
Chat Export
From the PWA sidebar, any conversation can be exported:- Markdown — raw
.mdfile with the full exchange - PDF — formatted PDF via the browser’s print dialog
- Word —
.docxexport (where supported)