Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TrinaxCode/TrinaxAI/llms.txt

Use this file to discover all available pages before exploring further.

TrinaxAI provides two distinct chat engines and an intelligent auto-router that silently selects the right Ollama model for every message. The router runs entirely offline — no LLM call, no latency — so you always get the best model without waiting.

Two Chat Engines

RAG Engine

Retrieves relevant chunks from your indexed codebase before generating a response. Every answer includes source citations (file, project, snippet, score). Best for questions about your code or documents you’ve indexed.

Ollama Engine

Sends messages directly to Ollama with no retrieval step. Faster, more creative, and better for general knowledge questions that don’t need codebase context.

Switching Engines

# Force RAG engine (retrieval + citations)
trinaxai --engine rag

# Force Ollama engine (direct chat, no retrieval)
trinaxai --engine ollama

# Default: auto-detect based on whether an index exists
trinaxai chat

Auto-Routing Heuristic

When auto-routing is active (TRINAXAI_AUTO_ROUTE=1, the default), TrinaxAI calls route_model() in config.py on every query. This function runs in microseconds with no LLM call:
def route_model(text: str) -> str:
    t = text.lower()
    is_code = ("`" in text) or any(h in t for h in _CODE_HINTS)
    is_deep = len(text) > 600 or any(h in t for h in _DEEP_HINTS)
    if is_deep:
        return MODEL_DEEP   # complex (code or not) → large model
    if is_code:
        return MODEL_CODE   # regular code → coder model
    if len(text.strip()) < 25:
        return MODEL_FAST   # greeting / trivial → ultra-fast
    return MODEL_GENERAL    # general chat → llama3.2
Code hints — keywords that indicate a coding question: function, def , class , import, const , react, python, typescript, api, endpoint, sql, bug, error, docker, git, .py, .ts, .js, and more. Deep hints — keywords that indicate complexity requiring the larger model: refactor, architecture, debug, performance, security, explain in detail, step by step, analyze, review, and more. Messages longer than 600 characters are also routed to the deep model. Fast route — messages shorter than 25 characters (greetings, one-word questions) use MODEL_FAST to minimise latency.

Model Fleet

The model assigned to each role depends on your hardware profile. All model names are real Ollama model identifiers.
RoleVariable8gb Profile16gb Profilemax Profileultra Profile
GeneralTRINAXAI_MODEL_GENERALllama3.2:1bllama3.2:3bllama3.2:3bllama3.2:3b
CodeTRINAXAI_MODEL_CODEqwen2.5-coder:1.5bqwen2.5-coder:3bqwen2.5-coder:3bqwen2.5-coder:3b
DeepTRINAXAI_MODEL_DEEPqwen2.5-coder:1.5bqwen2.5-coder:3bqwen2.5-coder:7bqwen2.5-coder:14b
FastTRINAXAI_MODEL_FASTllama3.2:1bllama3.2:3bllama3.2:3bllama3.2:3b
On the 8gb profile, MODEL_DEEP falls back to MODEL_CODE since there isn’t enough RAM for larger models. On ultra, the deep model scales up to qwen2.5-coder:14b.
Override any model for your specific setup by setting the corresponding env variable:
# .env
TRINAXAI_MODEL_GENERAL=llama3.2:3b
TRINAXAI_MODEL_CODE=qwen2.5-coder:3b
TRINAXAI_MODEL_DEEP=qwen2.5-coder:7b
TRINAXAI_MODEL_FAST=llama3.2:3b

Hardware Profiles and Context Windows

Each profile sets a default NUM_CTX (the Ollama context window in tokens) that fits within the available RAM alongside the model and embeddings.
ProfileRAM TargetNUM_CTXEmbed WorkersEmbed Batch
8gb~8 GB204811
16gb~16 GB409628
max32 GB+819248
ultra64 GB+ / GPU16384616
Override with TRINAXAI_NUM_CTX=<value>. The context window must fit: system prompt + retrieved chunks + conversation history + response.

Streaming SSE Chat

Both the RAG and Ollama engines stream responses to the PWA using Server-Sent Events (SSE). The RAG stream from POST /v1/chat/completions emits:
  1. {"trinaxai": {"model": "...", "project": "..."}} — metadata header
  2. {"choices": [{"delta": {"content": "token"}}]} — one event per token
  3. {"trinaxai_sources": [...]} — source citations after the full response
  4. data: [DONE] — stream terminator
The PWA renders tokens incrementally with Markdown support as they arrive.

Conversation History and Context

Each chat session maintains a conversation history in localStorage. When you send a message, the last 4 assistant/user turns are included in the synthesis prompt under CONVERSACIÓN PREVIA. This lets the model understand follow-up questions without needing an explicit query rewriter. The retrieval query is also enriched: it prepends the previous user turn to the current message, so “and what about the tests?” correctly retrieves test-related chunks even though the current message alone has no context.

Model Keep-Alive

TRINAXAI_KEEP_ALIVE controls how long Ollama keeps a model loaded in RAM after responding. Keeping the model warm avoids the reload cost (~1–5 seconds) on the next request.
ProfileDefault Keep-Alive
8gb0s (unload immediately — RAM is tight)
16gb10m (in fast mode)
max30m
ultra60m
# .env — keep models warm for 30 minutes
TRINAXAI_KEEP_ALIVE=30m

# Unload after every request to free RAM
TRINAXAI_KEEP_ALIVE=0s
The embedding model has a separate keep-alive (TRINAXAI_EMBED_KEEP_ALIVE, default 15m) because it’s called frequently during indexing and search — keeping it loaded prevents sawtooth RAM usage during batch operations.

Chat Export

From the PWA sidebar, any conversation can be exported:
  • Markdown — raw .md file with the full exchange
  • PDF — formatted PDF via the browser’s print dialog
  • Word.docx export (where supported)
Exports include message timestamps, the engine and model used, and any source citations returned by the RAG engine.

Build docs developers (and LLMs) love