Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/AlexsJones/llmfit/llms.txt

Use this file to discover all available pages before exploring further.

llmfit provides advanced flags and modes for edge cases, automation, and cluster deployments.

GPU Memory Override

GPU VRAM autodetection can fail on some systems (broken nvidia-smi, VMs, passthrough setups, remote GPUs). Use --memory to manually specify your GPU’s VRAM.
# Override with 32 GB VRAM
llmfit --memory=32G

# Megabytes also work (32000 MB ≈ 31.25 GB)
llmfit --memory=32000M

# Terabytes for large systems
llmfit --memory=1.5T
Accepted suffixes:
G / GB / GiB
suffix
Gigabytes (case-insensitive)
M / MB / MiB
suffix
Megabytes (case-insensitive)
T / TB / TiB
suffix
Terabytes (case-insensitive)
Behavior:
  • If no GPU was detected, --memory creates a synthetic GPU entry so models are scored for GPU inference
  • If a GPU was detected but VRAM is unknown or wrong, --memory overrides the detected value
  • Works with all modes: TUI, CLI, subcommands, and serve
Examples:
# TUI with override
llmfit --memory=24G

# CLI fit table
llmfit --memory=24G --cli

# Subcommands
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G system
llmfit --memory=24G info "Llama-3.1-70B"
llmfit --memory=24G recommend --json

# Serve mode
llmfit --memory=24G serve --host 0.0.0.0 --port 8787
Use cases:
  • VMs / passthrough: GPU is present but not directly visible to OS
  • Broken nvidia-smi: nvidia-smi reports incorrect VRAM or fails
  • Remote GPUs: Planning for a GPU you don’t have locally
  • Multi-GPU: Override with aggregate VRAM (e.g., 2x 24GB = 48GB)
--memory overrides VRAM only. It does not affect system RAM or CPU detection.

Context Length Cap

Use --max-context to cap the context length used for memory estimation. This does not change each model’s advertised maximum context — it only affects how much memory llmfit assumes the model will use.
# Cap context at 4K tokens
llmfit --max-context 4096 --cli

# Cap at 8K (good for most chat workloads)
llmfit --max-context 8192

# Cap at 16K (long documents, code analysis)
llmfit --max-context 16384
Why cap context?
  • Reduce memory usage: Longer context = more memory for KV cache
  • Realistic workloads: You may not need a model’s full 128k context window
  • Fit more models: Capping context can promote a model from “Marginal” to “Good” fit
Memory impact: KV cache size grows linearly with context length:
KV cache memory ≈ (context_length / 1000) * 0.1 GB per 1B params
Example for Llama-3.1-70B:
  • 4K context: ~0.7 GB KV cache
  • 8K context: ~1.4 GB KV cache
  • 128K context: ~22.4 GB KV cache
Fallback: If --max-context is not set, llmfit checks the OLLAMA_CONTEXT_LENGTH environment variable:
OLLAMA_CONTEXT_LENGTH=8192 llmfit
This is convenient if you use Ollama and have already configured your context length via OLLAMA_CONTEXT_LENGTH. Examples:
# TUI with 8K context cap
llmfit --max-context 8192

# CLI fit table
llmfit --max-context 8192 fit --perfect -n 5

# Recommendations
llmfit --max-context 4096 recommend --json --limit 5

# Serve mode (all API responses use capped context)
llmfit --max-context 8192 serve --host 0.0.0.0 --port 8787
API per-request override: In serve mode, you can override the context cap on a per-request basis with the max_context query parameter:
curl "http://localhost:8787/api/v1/models?max_context=16384&limit=10"

Remote Ollama

By default, llmfit connects to Ollama at http://localhost:11434. To connect to a remote Ollama instance, set the OLLAMA_HOST environment variable.
# Connect to Ollama on a specific IP and port
OLLAMA_HOST="http://192.168.1.100:11434" llmfit

# Connect via hostname
OLLAMA_HOST="http://ollama-server:666" llmfit

# Works with all TUI and CLI commands
OLLAMA_HOST="http://192.168.1.100:11434" llmfit --cli
OLLAMA_HOST="http://192.168.1.100:11434" llmfit fit --perfect -n 5
Use cases:
  • GPU server + laptop client: Run llmfit on your laptop while Ollama serves from a GPU server
  • Docker containers: Connect to Ollama running in a Docker container with custom ports
  • Reverse proxies: Use Ollama behind a reverse proxy or load balancer
How it works: llmfit makes HTTP requests to:
  • GET $OLLAMA_HOST/api/tags — List installed models
  • POST $OLLAMA_HOST/api/pull — Download models
The TUI shows install status and download progress for the remote Ollama instance. Example workflow:
# SSH tunnel to GPU server
ssh -L 11434:localhost:11434 gpu-server

# In another terminal, run llmfit locally (connects via tunnel)
llmfit
This allows you to use llmfit’s TUI on your local machine while managing models on a remote GPU server.
Combine OLLAMA_HOST with --memory to plan models for a remote GPU:
OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G

Serve Mode for Cluster Scheduling

The serve subcommand starts an HTTP API that exposes node-local model fit analysis. This is designed for cluster schedulers, aggregators, and remote clients that need to query hardware compatibility across multiple nodes.
# Start on default port (8787)
llmfit serve

# Bind to all interfaces
llmfit serve --host 0.0.0.0 --port 8787

# With global flags (applied to all API responses)
llmfit --memory 24G --max-context 8192 serve --host 0.0.0.0 --port 8787
Key endpoints:
GET /health
endpoint
Liveness probe. Returns {"status": "ok", "node": {...}}
GET /api/v1/system
endpoint
Node hardware info (CPU, RAM, GPU, backend)
GET /api/v1/models
endpoint
Full fit list with filters (limit, min_fit, runtime, use_case, etc.)
GET /api/v1/models/top
endpoint
Top runnable models for scheduling (conservative defaults: limit=5, min_fit=good)
See REST API Guide for full endpoint documentation, query parameters, and response schemas. Cluster scheduling workflow:
  1. Run llmfit serve on each node in your cluster
  2. From your scheduler, poll each node:
    curl http://node1:8787/api/v1/models/top?limit=5&min_fit=good
    curl http://node2:8787/api/v1/models/top?limit=5&min_fit=good
    curl http://node3:8787/api/v1/models/top?limit=5&min_fit=good
    
  3. Aggregate results and decide which node to schedule a model on
  4. Send deploy command to chosen node
Example aggregator (Python):
import requests
import json

nodes = ["http://node1:8787", "http://node2:8787", "http://node3:8787"]

for node_url in nodes:
    system = requests.get(f"{node_url}/api/v1/system").json()
    top_models = requests.get(f"{node_url}/api/v1/models/top?limit=5&min_fit=good").json()
    
    print(f"\nNode: {system['node']['name']}")
    print(f"GPU: {system['system']['gpu_name']} ({system['system']['gpu_vram_gb']} GB)")
    print(f"Top models:")
    for model in top_models["models"][:3]:
        print(f"  - {model['name']} (score: {model['score']:.1f}, fit: {model['fit_level']})")
Conservative placement defaults: For production placement, prefer:
min_fit=good
include_too_tight=false
sort=score
limit=5..20
This ensures only models that fit with headroom are considered.

Environment Variables

llmfit respects the following environment variables:
OLLAMA_HOST
string
default:"http://localhost:11434"
Ollama API URL. Set to connect to remote Ollama instances.Example:
OLLAMA_HOST="http://192.168.1.100:11434" llmfit
OLLAMA_CONTEXT_LENGTH
integer
Context length fallback for memory estimation when --max-context is not set.Example:
OLLAMA_CONTEXT_LENGTH=8192 llmfit
This is useful if you use Ollama and have already configured your context length via OLLAMA_CONTEXT_LENGTH.
Priority:
  1. --max-context flag (highest priority)
  2. OLLAMA_CONTEXT_LENGTH environment variable
  3. Model’s full advertised context (default, lowest priority)

Combining Flags

All global flags can be combined:
# TUI with memory override, context cap, and remote Ollama
OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G --max-context 16384

# CLI fit table
llmfit --memory 24G --max-context 8192 fit --perfect -n 5

# Serve mode with overrides
llmfit --memory 32G --max-context 8192 serve --host 0.0.0.0 --port 8787

# Plan for a remote GPU
OLLAMA_HOST="http://gpu-server:11434" llmfit --memory 80G plan "Llama-3.1-70B" --context 32768

Advanced Workflows

1. Multi-GPU Aggregate VRAM

If you have multiple GPUs with shared VRAM pool (e.g., NVLink), override with total VRAM:
# 4x A100 80GB = 320GB aggregate
llmfit --memory 320G
llmfit will score models as if you have a single 320GB GPU.

2. Planning for Future Hardware

Use --memory to plan models for a GPU you don’t have yet:
# Plan for RTX 5090 (32GB VRAM, hypothetical)
llmfit --memory 32G fit --perfect -n 10

3. Workload-Specific Context Caps

Chat workload (short conversations):
llmfit --max-context 4096 recommend --use-case chat --limit 5
Code analysis (medium context):
llmfit --max-context 16384 recommend --use-case coding --limit 5
Long documents (full context):
llmfit --max-context 131072 recommend --use-case reasoning --limit 5

4. Remote Hardware Inspection

SSH into a remote node and check its hardware without installing llmfit:
# On local machine
ssh gpu-server 'curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local'

# Run fit analysis remotely
ssh gpu-server '~/.local/bin/llmfit --json fit -n 5' | jq '.models[] | {name, score, fit_level}'

5. Kubernetes Cluster Scheduling

Deploy llmfit as a DaemonSet on all GPU nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: llmfit-serve
spec:
  selector:
    matchLabels:
      app: llmfit
  template:
    metadata:
      labels:
        app: llmfit
    spec:
      hostNetwork: true
      containers:
      - name: llmfit
        image: ghcr.io/alexsjones/llmfit:latest
        command: ["/usr/local/bin/llmfit"]
        args: ["serve", "--host", "0.0.0.0", "--port", "8787"]
        ports:
        - containerPort: 8787
          name: http
        resources:
          requests:
            nvidia.com/gpu: 1
Then query each node’s API from your scheduler:
kubectl get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' | \
  xargs -I {} curl -s http://{}:8787/api/v1/models/top?limit=5 | jq '.models[].name'

Performance Considerations

TUI Startup Time

The TUI probes all providers (Ollama, MLX, llama.cpp) on startup. On slow networks or with many installed models, this can take 1-2 seconds. To skip provider detection, use CLI mode:
llmfit --cli  # No provider probing

API Response Time

The REST API computes fit analysis on each request. For large model databases (200+ models), this takes ~50-100ms. To reduce latency:
  • Use limit parameter to reduce result set
  • Use min_fit=good to exclude unrunnable models
  • Cache results on the client side if hardware doesn’t change

Download Speed

  • Ollama: Controlled by Ollama daemon (typically saturates bandwidth)
  • llama.cpp: Direct HuggingFace download (typically faster than Ollama)
  • MLX: Direct HuggingFace download via mlx_lm (similar to llama.cpp)
To maximize download speed, use llama.cpp or MLX instead of Ollama.

Troubleshooting

GPU Not Detected

Symptom: TUI shows “GPU: none” even though you have a GPU. Causes:
  • nvidia-smi not in PATH or not working
  • VM/passthrough setup where GPU is not visible to OS
  • AMD GPU without rocm-smi
  • Intel Arc without proper drivers
Solution: Use --memory to override:
llmfit --memory 24G

Wrong VRAM Amount

Symptom: TUI shows incorrect VRAM (e.g., 16GB instead of 24GB). Causes:
  • nvidia-smi reporting bug
  • Shared memory incorrectly reported
  • Multi-GPU with incorrect aggregation
Solution: Use --memory to override:
llmfit --memory 24G

Models Don’t Fit as Expected

Symptom: Models you think should fit are marked “Too Tight”. Causes:
  • Context length too high (KV cache uses a lot of memory)
  • Available RAM lower than you think (OS overhead, other processes)
  • Model requires more memory than you expect (MoE inactive experts, etc.)
Solution: Cap context length:
llmfit --max-context 8192
Or check actual available RAM:
llmfit system

Ollama Not Detected

Symptom: TUI shows “Ollama: ✗” even though Ollama is running. Causes:
  • Ollama running on non-default port
  • Firewall blocking localhost:11434
  • Ollama not fully started yet
Solution: Set OLLAMA_HOST:
OLLAMA_HOST="http://localhost:11434" llmfit
Or wait a few seconds and restart llmfit.

Download Fails

Symptom: Download starts but fails with an error. Causes:
  • Network error (HuggingFace unreachable)
  • Disk full
  • Ollama daemon stopped mid-download
  • GGUF repo not found
Solution:
  1. Check network: curl -I https://huggingface.co
  2. Check disk space: df -h
  3. Restart Ollama: ollama serve
  4. Try a different provider (Ollama vs llama.cpp)
Use llmfit --memory <size> system to verify that the override is applied correctly before running fit analysis.
All advanced flags (--memory, --max-context, OLLAMA_HOST) work in TUI, CLI, and serve modes. In serve mode, they affect all API responses.

Build docs developers (and LLMs) love