The llama.cpp backend is the most portable inference option in Odysseus. On first launch it callsDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/techjarves/Odysseus-Portable/llms.txt
Use this file to discover all available pages before exploring further.
getLlamaCppAssets(hw) in src/downloader.js to query the GitHub Releases API, selects the binary variant that matches your exact hardware — CUDA, Vulkan, Metal, or CPU — downloads and extracts it to a local bin/ subdirectory, then starts llama-server and sits a lightweight Node.js HTTP proxy in front of it. Every subsequent launch reuses the cached binary, so startup is instant after the first run.
The llama-server binary is downloaded once and cached in
bin/llama-<os>-<arch>/ (or bin/llama/ on Windows). Subsequent launches skip the download entirely and go straight to server startup.How It Works
Hardware detection
detectHardware() in src/system.js probes the platform, architecture, and GPU. It checks for an NVIDIA GPU via nvidia-smi, Apple Silicon via platform + arch, Vulkan via vulkan-1.dll on Windows or libvulkan.so.1 on Linux, and falls back to CPU if nothing is found. The detected GPU backend (cuda, vulkan, metal, or cpu) drives every subsequent decision.Binary selection and download
getLlamaCppAssets(hw) queries the GitHub Releases API for the best-matching asset for your platform. On Linux CUDA systems it uses the ai-dock/llama.cpp-cuda release repository instead of the upstream ggml-org/llama.cpp repository to obtain CUDA-linked builds. The selected archive is downloaded, extracted, and the original archive file is deleted.llama-server startup
The orchestrator spawns
llama-server with --port 10086 --models-dir models/ --models-max 1 --parallel <N> --ctx-size <N> --threads 4 -ngl <ngl>. GPU-accelerated systems receive -ngl 99 to offload all layers; CPU-only systems receive -ngl 0. llama-server binds to port 10086.HTTP proxy on port 8080
A Node.js HTTP proxy listens on port 8080 and forwards requests to llama-server on port 10086. The proxy rewrites model names in request bodies so that friendly identifiers (e.g.
mistral-7b) are transparently mapped to the actual GGUF filename that llama-server expects. It also intercepts GET /v1/models and returns the discovered model list directly.Binary Selection Logic
The table below shows which release asset is downloaded for each platform and GPU combination. Windows CUDA builds also download a secondarycudart-* asset that provides the CUDA runtime DLLs.
| Platform | GPU | Asset pattern |
|---|---|---|
| Windows | CUDA | llama-*-bin-win-cuda-12.4*.zip + cudart-* secondary |
| Windows | Vulkan | llama-*-bin-win-vulkan*.zip |
| Windows | CPU | llama-*-bin-win-cpu-x64*.zip |
| macOS | ARM64 (Metal) | llama-*-bin-macos-arm64*.tar.gz |
| macOS | x64 | llama-*-bin-macos-x64*.tar.gz |
| Linux | CUDA | ai-dock/llama.cpp-cuda release, *-cuda-*amd64.tar.gz |
| Linux | Vulkan | llama-*-bin-ubuntu-vulkan-x64*.tar.gz |
| Linux | CPU ARM64 | llama-*-bin-ubuntu-arm64*.tar.gz |
| Linux | CPU x64 | llama-*-bin-ubuntu-x64*.tar.gz |
Context Window Auto-Scaling
One of the most important features of the llama.cpp backend is its automatic context window management. Rather than crashing when VRAM is exhausted, it estimates the largest context that will fit in available memory, then steps down gracefully if a request still triggers an OOM error at runtime.The context ladder
The backend tries context sizes in descending order until one fits:Auto-calculation (chooseAutoContext)
The chooseAutoContext function in src/backends/llama/index.js estimates the right starting point on the ladder before llama-server is even started:
- Available memory — For GPU backends, uses
gpuFreeMemoryGBfromnvidia-smiif available, falls back togpuMemoryGB, then toMath.max(4, floor(ramGB × 0.55)). For CPU, usesramGBdirectly. Overhead (2.5 GB for GPU, 2 GB for CPU) and the largest model size are then subtracted to arrive atusableGB. - Model size — scans
models/for.gguffiles and finds the largest one. - KV cache estimate — applies a GB-per-4096-tokens factor based on model size.
- Parallel slots — divides the estimated max context by the number of parallel slots.
usableGB:
Runtime OOM recovery
If llama-server returns an HTTP 5xx response containing any of the stringsout of memory, cuda, failed to allocate, failed to create context, or failed to initialize the context, the proxy calls router.retryLowerContext(). This terminates the running llama-server, restarts it with the next smaller context on the ladder, and retries the original request exactly once.
Overriding context size
Parallel Request Slots
By default, llama-server is started with a single parallel slot (--parallel 1). Increasing parallel slots allows multiple in-flight requests to be processed simultaneously, at the cost of dividing the available context window across slots.
llama-server Arguments
The exact command spawned bystartRouter in src/backends/llama/index.js:
ngl is 99 for GPU-accelerated systems (CUDA, Vulkan, Metal) and 0 for CPU-only. parallel and ctxSize are determined by ODYSSEUS_LLAMA_PARALLEL / ODYSSEUS_LLAMA_CTX or auto-calculated as described above.
Platform-Specific Setup
macOS — Gatekeeper quarantine removal
macOS — Gatekeeper quarantine removal
Downloaded binaries from the internet receive a No manual intervention is needed. If the command fails (e.g. on a read-only filesystem), a warning is logged but the launch continues.
com.apple.quarantine extended attribute that prevents them from running. The orchestrator automatically strips this attribute immediately after extraction:Linux CUDA — runtime library resolution
Linux CUDA — runtime library resolution
Some Linux systems do not have The relevant
libcudart.so.12 installed globally. If the file is absent from the binary directory after extraction, the orchestrator automatically installs the required libraries via uv pip:.so files are then copied from the installed Python packages into the llama binary directory so llama-server can find them at runtime via LD_LIBRARY_PATH.Environment Variable Reference
| Variable | Default | Description |
|---|---|---|
ODYSSEUS_LLAMA_CTX | auto | Force a specific context window size (tokens). |
ODYSSEUS_LLAMA_PARALLEL | 1 | Number of parallel inference slots. |
ODYSSEUS_BACKEND | from config | Set to llama to select this backend. |