Advanced Usage

llmfit provides advanced flags and modes for edge cases, automation, and cluster deployments.

GPU Memory Override

GPU VRAM autodetection can fail on some systems (broken nvidia-smi, VMs, passthrough setups, remote GPUs). Use --memory to manually specify your GPU’s VRAM.

# Override with 32 GB VRAM
llmfit --memory=32G

# Megabytes also work (32000 MB ≈ 31.25 GB)
llmfit --memory=32000M

# Terabytes for large systems
llmfit --memory=1.5T

Accepted suffixes:

G / GB / GiB

suffix

Gigabytes (case-insensitive)

M / MB / MiB

suffix

Megabytes (case-insensitive)

T / TB / TiB

suffix

Terabytes (case-insensitive)

Behavior:

If no GPU was detected, --memory creates a synthetic GPU entry so models are scored for GPU inference
If a GPU was detected but VRAM is unknown or wrong, --memory overrides the detected value
Works with all modes: TUI, CLI, subcommands, and serve

Examples:

# TUI with override
llmfit --memory=24G

# CLI fit table
llmfit --memory=24G --cli

# Subcommands
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G system
llmfit --memory=24G info "Llama-3.1-70B"
llmfit --memory=24G recommend --json

# Serve mode
llmfit --memory=24G serve --host 0.0.0.0 --port 8787

Use cases:

VMs / passthrough: GPU is present but not directly visible to OS
Broken nvidia-smi: nvidia-smi reports incorrect VRAM or fails
Remote GPUs: Planning for a GPU you don’t have locally
Multi-GPU: Override with aggregate VRAM (e.g., 2x 24GB = 48GB)

--memory overrides VRAM only. It does not affect system RAM or CPU detection.

Context Length Cap

Use --max-context to cap the context length used for memory estimation. This does not change each model’s advertised maximum context — it only affects how much memory llmfit assumes the model will use.

# Cap context at 4K tokens
llmfit --max-context 4096 --cli

# Cap at 8K (good for most chat workloads)
llmfit --max-context 8192

# Cap at 16K (long documents, code analysis)
llmfit --max-context 16384

Why cap context?

Reduce memory usage: Longer context = more memory for KV cache
Realistic workloads: You may not need a model’s full 128k context window
Fit more models: Capping context can promote a model from “Marginal” to “Good” fit

Memory impact: KV cache size grows linearly with context length:

KV cache memory ≈ (context_length / 1000) * 0.1 GB per 1B params

Example for Llama-3.1-70B:

4K context: ~0.7 GB KV cache
8K context: ~1.4 GB KV cache
128K context: ~22.4 GB KV cache

Fallback: If --max-context is not set, llmfit checks the OLLAMA_CONTEXT_LENGTH environment variable:

OLLAMA_CONTEXT_LENGTH=8192 llmfit

This is convenient if you use Ollama and have already configured your context length via OLLAMA_CONTEXT_LENGTH. Examples:

# TUI with 8K context cap
llmfit --max-context 8192

# CLI fit table
llmfit --max-context 8192 fit --perfect -n 5

# Recommendations
llmfit --max-context 4096 recommend --json --limit 5

# Serve mode (all API responses use capped context)
llmfit --max-context 8192 serve --host 0.0.0.0 --port 8787

API per-request override: In serve mode, you can override the context cap on a per-request basis with the max_context query parameter:

curl "http://localhost:8787/api/v1/models?max_context=16384&limit=10"

Remote Ollama

By default, llmfit connects to Ollama at http://localhost:11434. To connect to a remote Ollama instance, set the OLLAMA_HOST environment variable.

# Connect to Ollama on a specific IP and port
OLLAMA_HOST="http://192.168.1.100:11434" llmfit

# Connect via hostname
OLLAMA_HOST="http://ollama-server:666" llmfit

# Works with all TUI and CLI commands
OLLAMA_HOST="http://192.168.1.100:11434" llmfit --cli
OLLAMA_HOST="http://192.168.1.100:11434" llmfit fit --perfect -n 5

Use cases:

GPU server + laptop client: Run llmfit on your laptop while Ollama serves from a GPU server
Docker containers: Connect to Ollama running in a Docker container with custom ports
Reverse proxies: Use Ollama behind a reverse proxy or load balancer

How it works: llmfit makes HTTP requests to:

GET $OLLAMA_HOST/api/tags — List installed models
POST $OLLAMA_HOST/api/pull — Download models

The TUI shows install status and download progress for the remote Ollama instance. Example workflow:

# SSH tunnel to GPU server
ssh -L 11434:localhost:11434 gpu-server

# In another terminal, run llmfit locally (connects via tunnel)
llmfit

This allows you to use llmfit’s TUI on your local machine while managing models on a remote GPU server.

Combine OLLAMA_HOST with --memory to plan models for a remote GPU:

OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G

Serve Mode for Cluster Scheduling

The serve subcommand starts an HTTP API that exposes node-local model fit analysis. This is designed for cluster schedulers, aggregators, and remote clients that need to query hardware compatibility across multiple nodes.

# Start on default port (8787)
llmfit serve

# Bind to all interfaces
llmfit serve --host 0.0.0.0 --port 8787

# With global flags (applied to all API responses)
llmfit --memory 24G --max-context 8192 serve --host 0.0.0.0 --port 8787

Key endpoints:

GET /health

endpoint

Liveness probe. Returns {"status": "ok", "node": {...}}

GET /api/v1/system

endpoint

Node hardware info (CPU, RAM, GPU, backend)

GET /api/v1/models

endpoint

Full fit list with filters (limit, min_fit, runtime, use_case, etc.)

GET /api/v1/models/top

endpoint

Top runnable models for scheduling (conservative defaults: limit=5, min_fit=good)

See REST API Guide for full endpoint documentation, query parameters, and response schemas. Cluster scheduling workflow:

Run llmfit serve on each node in your cluster

From your scheduler, poll each node:

curl http://node1:8787/api/v1/models/top?limit=5&min_fit=good
curl http://node2:8787/api/v1/models/top?limit=5&min_fit=good
curl http://node3:8787/api/v1/models/top?limit=5&min_fit=good

Aggregate results and decide which node to schedule a model on
Send deploy command to chosen node

Example aggregator (Python):

import requests
import json

nodes = ["http://node1:8787", "http://node2:8787", "http://node3:8787"]

for node_url in nodes:
    system = requests.get(f"{node_url}/api/v1/system").json()
    top_models = requests.get(f"{node_url}/api/v1/models/top?limit=5&min_fit=good").json()
    
    print(f"\nNode: {system['node']['name']}")
    print(f"GPU: {system['system']['gpu_name']} ({system['system']['gpu_vram_gb']} GB)")
    print(f"Top models:")
    for model in top_models["models"][:3]:
        print(f"  - {model['name']} (score: {model['score']:.1f}, fit: {model['fit_level']})")

Conservative placement defaults: For production placement, prefer:

min_fit=good
include_too_tight=false
sort=score
limit=5..20

This ensures only models that fit with headroom are considered.

Environment Variables

llmfit respects the following environment variables:

OLLAMA_HOST

string

default:"http://localhost:11434"

Ollama API URL. Set to connect to remote Ollama instances.Example:

OLLAMA_HOST="http://192.168.1.100:11434" llmfit

OLLAMA_CONTEXT_LENGTH

integer

Context length fallback for memory estimation when --max-context is not set.Example:

OLLAMA_CONTEXT_LENGTH=8192 llmfit

This is useful if you use Ollama and have already configured your context length via OLLAMA_CONTEXT_LENGTH.

Priority:

--max-context flag (highest priority)
OLLAMA_CONTEXT_LENGTH environment variable
Model’s full advertised context (default, lowest priority)

Combining Flags

All global flags can be combined:

# TUI with memory override, context cap, and remote Ollama
OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G --max-context 16384

# CLI fit table
llmfit --memory 24G --max-context 8192 fit --perfect -n 5

# Serve mode with overrides
llmfit --memory 32G --max-context 8192 serve --host 0.0.0.0 --port 8787

# Plan for a remote GPU
OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G plan "Llama-3.1-70B" --context 32768

Advanced Workflows

1. Multi-GPU Aggregate VRAM

If you have multiple GPUs with shared VRAM pool (e.g., NVLink), override with total VRAM:

# 4x A100 80GB = 320GB aggregate
llmfit --memory 320G

llmfit will score models as if you have a single 320GB GPU.

2. Planning for Future Hardware

Use --memory to plan models for a GPU you don’t have yet:

# Plan for RTX 5090 (32GB VRAM, hypothetical)
llmfit --memory 32G fit --perfect -n 10

3. Workload-Specific Context Caps

Chat workload (short conversations):

llmfit --max-context 4096 recommend --use-case chat --limit 5

Code analysis (medium context):

llmfit --max-context 16384 recommend --use-case coding --limit 5

Long documents (full context):

llmfit --max-context 131072 recommend --use-case reasoning --limit 5

4. Remote Hardware Inspection

SSH into a remote node and check its hardware without installing llmfit:

# On local machine
ssh gpu-server 'curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local'

# Run fit analysis remotely
ssh gpu-server '~/.local/bin/llmfit --json fit -n 5' | jq '.models[] | {name, score, fit_level}'

5. Kubernetes Cluster Scheduling

Deploy llmfit as a DaemonSet on all GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: llmfit-serve
spec:
  selector:
    matchLabels:
      app: llmfit
  template:
    metadata:
      labels:
        app: llmfit
    spec:
      hostNetwork: true
      containers:
      - name: llmfit
        image: ghcr.io/alexsjones/llmfit:latest
        command: ["/usr/local/bin/llmfit"]
        args: ["serve", "--host", "0.0.0.0", "--port", "8787"]
        ports:
        - containerPort: 8787
          name: http
        resources:
          requests:
            nvidia.com/gpu: 1

Then query each node’s API from your scheduler:

kubectl get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | \
  xargs -I {} curl -s http://{}:8787/api/v1/models/top?limit=5 | jq '.models[].name'

Performance Considerations

TUI Startup Time

The TUI probes all providers (Ollama, MLX, llama.cpp) on startup. On slow networks or with many installed models, this can take 1-2 seconds. To skip provider detection, use CLI mode:

llmfit --cli  # No provider probing

API Response Time

The REST API computes fit analysis on each request. For large model databases (200+ models), this takes ~50-100ms. To reduce latency:

Use limit parameter to reduce result set
Use min_fit=good to exclude unrunnable models
Cache results on the client side if hardware doesn’t change

Download Speed

Ollama: Controlled by Ollama daemon (typically saturates bandwidth)
llama.cpp: Direct HuggingFace download (typically faster than Ollama)
MLX: Direct HuggingFace download via mlx_lm (similar to llama.cpp)

To maximize download speed, use llama.cpp or MLX instead of Ollama.

Troubleshooting

GPU Not Detected

Symptom: TUI shows “GPU: none” even though you have a GPU. Causes:

nvidia-smi not in PATH or not working
VM/passthrough setup where GPU is not visible to OS
AMD GPU without rocm-smi
Intel Arc without proper drivers

Solution: Use --memory to override:

llmfit --memory 24G

Wrong VRAM Amount

Symptom: TUI shows incorrect VRAM (e.g., 16GB instead of 24GB). Causes:

nvidia-smi reporting bug
Shared memory incorrectly reported
Multi-GPU with incorrect aggregation

Solution: Use --memory to override:

llmfit --memory 24G

Models Don’t Fit as Expected

Symptom: Models you think should fit are marked “Too Tight”. Causes:

Context length too high (KV cache uses a lot of memory)
Available RAM lower than you think (OS overhead, other processes)
Model requires more memory than you expect (MoE inactive experts, etc.)

Solution: Cap context length:

llmfit --max-context 8192

Or check actual available RAM:

llmfit system

Ollama Not Detected

Symptom: TUI shows “Ollama: ✗” even though Ollama is running. Causes:

Ollama running on non-default port
Firewall blocking localhost:11434
Ollama not fully started yet

Solution: Set OLLAMA_HOST:

OLLAMA_HOST="http://localhost:11434" llmfit

Or wait a few seconds and restart llmfit.

Download Fails

Symptom: Download starts but fails with an error. Causes:

Network error (HuggingFace unreachable)
Disk full
Ollama daemon stopped mid-download
GGUF repo not found

Solution:

Check network: curl -I https://huggingface.co
Check disk space: df -h
Restart Ollama: ollama serve
Try a different provider (Ollama vs llama.cpp)

Use llmfit --memory <size> system to verify that the override is applied correctly before running fit analysis.

All advanced flags (--memory, --max-context, OLLAMA_HOST) work in TUI, CLI, and serve modes. In serve mode, they affect all API responses.

Get Started

Core Concepts

Guides

Platform Support

Advanced Usage

GPU Memory Override

Context Length Cap

Remote Ollama

Serve Mode for Cluster Scheduling

Environment Variables

Combining Flags

Advanced Workflows

1. Multi-GPU Aggregate VRAM

2. Planning for Future Hardware

3. Workload-Specific Context Caps

4. Remote Hardware Inspection

5. Kubernetes Cluster Scheduling

Performance Considerations

TUI Startup Time

API Response Time

Download Speed

Troubleshooting

GPU Not Detected

Wrong VRAM Amount

Models Don’t Fit as Expected

Ollama Not Detected

Download Fails

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform Support

Documentation Index

​GPU Memory Override

​Context Length Cap

​Remote Ollama

​Serve Mode for Cluster Scheduling

​Environment Variables

​Combining Flags

​Advanced Workflows

​1. Multi-GPU Aggregate VRAM

​2. Planning for Future Hardware

​3. Workload-Specific Context Caps

​4. Remote Hardware Inspection

​5. Kubernetes Cluster Scheduling

​Performance Considerations

​TUI Startup Time

​API Response Time

​Download Speed

​Troubleshooting

​GPU Not Detected

​Wrong VRAM Amount

​Models Don’t Fit as Expected

​Ollama Not Detected

​Download Fails

Build docs developers (and LLMs) love

GPU Memory Override

Context Length Cap

Remote Ollama

Serve Mode for Cluster Scheduling

Environment Variables

Combining Flags

Advanced Workflows

1. Multi-GPU Aggregate VRAM

2. Planning for Future Hardware

3. Workload-Specific Context Caps

4. Remote Hardware Inspection

5. Kubernetes Cluster Scheduling

Performance Considerations

TUI Startup Time

API Response Time

Download Speed

Troubleshooting

GPU Not Detected

Wrong VRAM Amount

Models Don’t Fit as Expected

Ollama Not Detected

Download Fails