Odysseus Portable performs hardware detection on every launch and selects the best available inference backend automatically. There is nothing to configure: if a supported GPU is present, it will be used. The orchestrator then downloads the matchingDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/techjarves/Odysseus-Portable/llms.txt
Use this file to discover all available pages before exploring further.
llama-server binary for your GPU backend and calculates an optimal context window size based on your available VRAM or RAM. The only time you need to intervene is when you want to override the automatic context size or force a specific behavior for testing.
How GPU Detection Works
Hardware detection runs insidesrc/system.js before any binary is downloaded or launched. The detection follows a strict priority order.
Apple Silicon Metal (macOS ARM64)
If the current platform is
darwin and the CPU architecture is arm64, Metal acceleration is selected immediately without any additional probing. All Apple Silicon Macs (M1, M2, M3, M4 and later) qualify.NVIDIA CUDA (Windows and Linux)
The orchestrator runs If the command succeeds,
nvidia-smi with the --query-gpu flag to retrieve GPU information:gpuBackend is set to 'cuda'. The GPU name, total VRAM (in MiB), and free VRAM are all captured and used for context window calculations.Vulkan (Windows and Linux)
When CUDA is not detected, the orchestrator checks for Vulkan support:
- Windows: Looks for
vulkan-1.dllin%windir%\System32. Present on virtually all modern Windows systems with a discrete or integrated GPU. - Linux: Checks for
vulkaninfoinPATH, then falls back to checking these library paths:/usr/lib/x86_64-linux-gnu/libvulkan.so.1/usr/lib/libvulkan.so.1/usr/lib64/libvulkan.so.1
gpuBackend is set to 'vulkan'.GPU Detection Effects on Inference
The detected GPU backend directly controls two things: whichllama-server binary is downloaded, and how many model layers are offloaded to the GPU.
| GPU Backend | ngl (GPU layers) | Effect |
|---|---|---|
metal, cuda, vulkan | 99 (all layers) | Full GPU acceleration — fastest inference |
cpu | 0 (no offload) | All computation on CPU — slower but functional |
The
ngl 99 value does not mean exactly 99 layers are loaded — it is a sentinel value that tells llama-server to offload as many layers as possible. The actual number is capped by the model’s layer count and available VRAM.Automatic Context Window Sizing
The llama backend (src/backends/llama/index.js) calculates the best context window size from your available memory before starting llama-server. This avoids OOM crashes on first load and maximises the context you can actually use.
The core formula:
kv_factor) scales with model size to reflect real-world memory usage:
| Largest model on drive | KV factor (GB per 4096 tokens) |
|---|---|
| ≥ 5.5 GB | 1.05 |
| ≥ 4.0 GB | 0.85 |
| < 4.0 GB | 0.55 |
Platform-Specific Notes
- NVIDIA (CUDA)
- AMD / Intel (Vulkan)
- Apple Silicon (Metal)
- CPU Only
Supported on: Windows and Linux x64/ARM64The orchestrator downloads a CUDA-accelerated
llama-server binary from the official llama.cpp releases (targeting CUDA 12.4). On Linux, if the CUDA runtime libraries are not already installed, the orchestrator automatically fetches them via uv pip install:nvidia-cuda-runtime-cu12nvidia-cublas-cu12nvidia-nccl-cu12nvidia-cuda-nvrtc-cu12
bin/llama-linux-x64/ directory so no system-level CUDA installation is required.Your NVIDIA display driver must be installed and up to date on the host machine. The driver is not bundled with Odysseus Portable. Download it from nvidia.com/drivers.
Overriding the Context Window
If the automatic context calculation gives a suboptimal result — for example, if you have recently freed VRAM by closing other applications, or if you simply want a specific context size — you can override it with theODYSSEUS_LLAMA_CTX environment variable.
ODYSSEUS_LLAMA_PARALLEL to change the number of parallel inference slots. Each additional slot multiplies the KV cache memory requirement, so lower the context if you increase parallelism.