Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TrinaxCode/TrinaxAI/llms.txt

Use this file to discover all available pages before exploring further.

TrinaxAI exposes two lightweight monitoring endpoints that do not require authorization. /health provides a comprehensive snapshot of the system state — including whether the index is loaded, which models are available, the current hardware profile, and which optional features are active. /resources reports local RAM and VRAM telemetry for the PWA’s resource meter.
Both /health and /resources are open to all trusted CORS origins (localhost ports 3334/3335 and private LAN IPs) without an admin token. They are safe to poll frequently from the PWA.

GET /health

Returns a comprehensive system status snapshot. Intended to be polled by the PWA on startup and periodically to detect index readiness, Ollama availability, and active feature flags.

Response

ok
boolean
Always true when the RAG API is running. (The API returning any response at all means it is healthy.)
indexed
boolean
true if the hybrid retriever (vector + BM25) is loaded and ready to answer queries.
ollama
boolean
true if Ollama is reachable at the configured OLLAMA_BASE_URL. Cached for 5 seconds to avoid hammering the Ollama process.
projects
array
Sorted list of project names detected in the indexed content. Empty if the index is not loaded.
collections
array
Full list of collection objects (same shape as GET /collections).
models
array
The active model fleet list — ordered by preference. Derived from the current profile and TRINAXAI_MODEL_* environment variables. Typical 16gb fleet: ["qwen2.5-coder:3b", "qwen2.5-coder:7b", "llama3.2:3b"].
profile
string
Active hardware profile. One of: "8gb", "16gb", "max", "ultra" (or custom). Controls model selection, chunk sizes, context window, and retrieval depth.
num_ctx
integer
Context window size in tokens for the active profile. 2048 for low-resource, 4096 default, 8192 for max, 16384 for ultra.
embed_workers
integer
Number of concurrent embedding workers. Higher = faster indexing but more RAM pressure.
embed_batch_size
integer
Number of chunks per embedding batch request.
embed_keep_alive
string
How long Ollama keeps the embedding model loaded in memory between requests (e.g. "15m").
performance_mode
string
One of "fast", "balanced", "quality". Controls retrieval depth and cache TTLs.
fusion_candidates
integer
Number of candidates each retriever (vector + BM25) contributes before reciprocal rank fusion.
similarity_top_k
integer
Final number of chunks injected into the LLM context after fusion (and optional reranking).
retrieval_cache_seconds
integer
TTL for the retrieval result cache. 0 disables caching.
rerank
boolean
true if the cross-encoder reranker (BAAI/bge-reranker-v2-m3) is enabled via TRINAXAI_RERANK=1.
features
object
Active feature flags.

Example

curl http://localhost:3333/health
Response (16gb profile, index loaded)
{
  "ok": true,
  "indexed": true,
  "ollama": true,
  "projects": ["AdminPanel", "Insider", "MyApp"],
  "collections": [
    {
      "id": "default",
      "name": "General",
      "created_at": 1718000000.0,
      "updated_at": 1718000000.0
    },
    {
      "id": "my-project",
      "name": "My Project",
      "created_at": 1718100000.0,
      "updated_at": 1718200000.0
    }
  ],
  "models": ["qwen2.5-coder:3b", "qwen2.5-coder:7b", "llama3.2:3b"],
  "profile": "16gb",
  "num_ctx": 4096,
  "embed_workers": 2,
  "embed_batch_size": 8,
  "embed_keep_alive": "15m",
  "performance_mode": "fast",
  "fusion_candidates": 8,
  "similarity_top_k": 4,
  "retrieval_cache_seconds": 20,
  "rerank": false,
  "features": {
    "folder_upload_indexing": true,
    "hybrid_retrieval": true,
    "sources": true,
    "collections": true,
    "local_app_state": true,
    "resources": true,
    "lan_system_actions": true,
    "profiles": ["8gb", "16gb", "max", "ultra"]
  }
}
Response (no index yet)
{
  "ok": true,
  "indexed": false,
  "ollama": true,
  "projects": [],
  "collections": [
    {
      "id": "default",
      "name": "General",
      "created_at": 1718000000.0,
      "updated_at": 1718000000.0
    }
  ],
  "models": ["qwen2.5-coder:3b", "qwen2.5-coder:7b", "llama3.2:3b"],
  "profile": "16gb",
  "num_ctx": 4096,
  "embed_workers": 2,
  "embed_batch_size": 8,
  "embed_keep_alive": "15m",
  "performance_mode": "fast",
  "fusion_candidates": 8,
  "similarity_top_k": 4,
  "retrieval_cache_seconds": 20,
  "rerank": false,
  "features": {
    "folder_upload_indexing": true,
    "hybrid_retrieval": true,
    "sources": true,
    "collections": true,
    "local_app_state": true,
    "resources": true,
    "lan_system_actions": true,
    "profiles": ["8gb", "16gb", "max", "ultra"]
  }
}

GET /resources

Basic local RAM telemetry for the PWA’s resource usage panel. Uses psutil when available for precise used/available figures, with a graceful fallback to OS syscalls (sysconf) that only reports total RAM. VRAM reporting is reserved for a future release (vram: null).
This endpoint requires psutil for full RAM metrics (used_gb, percent). Without it, only total is reported. Install with pip install psutil for complete telemetry.

Response

ok
boolean
Always true.
ram
object | null
RAM usage object. null if memory information could not be read from the OS.
vram
null
Reserved for future GPU VRAM telemetry. Always null in the current release.

Examples

curl http://localhost:3333/resources
With psutil installed
{
  "ok": true,
  "ram": {
    "total": 34359738368,
    "available": 22548578304,
    "used": 11811160064,
    "percent": 34.4
  },
  "vram": null
}
Without psutil (sysconf fallback)
{
  "ok": true,
  "ram": {
    "total": 34359738368,
    "available": null,
    "used": null,
    "percent": null
  },
  "vram": null
}
The total value is in bytes. To convert to GB: total / (1024 ** 3). For the example above: 34359738368 / 1073741824 ≈ 32 GB.

Build docs developers (and LLMs) love