Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/NVIDIA/OpenShell/llms.txt

Use this file to discover all available pages before exploring further.

This tutorial covers two ways to run local inference with OpenShell: using Ollama or using LM Studio. Both approaches expose a local model backend through inference.local so that agents inside a sandbox can make inference requests without reaching external APIs.
Ollama offers two approaches: a self-contained community sandbox with Ollama pre-installed, or routing sandbox inference to a host-level Ollama instance shared across multiple sandboxes.

Prerequisites

  • A working OpenShell installation. Complete the Quickstart before proceeding.
The Ollama community sandbox bundles Ollama, Claude Code, OpenCode, and Codex into a single image. Ollama starts automatically when the sandbox launches.
1

Create the sandbox

$ openshell sandbox create --from ollama
This pulls the community sandbox image, applies the bundled policy, and drops you into a shell with Ollama running.
2

Run a model

Chat with a local model:
$ ollama run qwen3.5
Or run a cloud-hosted model (no local GPU required):
$ ollama run kimi-k2.5:cloud
To start a coding agent with Ollama as the model backend, use ollama launch:
$ ollama launch claude
$ ollama launch codex
$ ollama launch opencode
For CI/CD and automated workflows, ollama launch supports a headless mode:
$ ollama launch claude --yes --model qwen3.5

Model recommendations

Use caseModelNotes
Smoke testqwen3.5:0.8bFast and lightweight, good for verifying setup
Coding and reasoningqwen3.5Strong tool calling support for agentic workflows
Complex tasksnemotron-3-super122B parameter model, requires 48 GB+ VRAM
No local GPUqwen3.5:cloudRuns on Ollama’s cloud infrastructure, no ollama pull required
Cloud models use the :cloud tag suffix and do not require local hardware.

Tool calling

Agentic workflows (Claude Code, Codex, OpenCode) rely on tool calling. The following models have reliable tool calling support: Qwen 3.5, Nemotron-3-Super, GLM-5, and Kimi-K2.5. Check the Ollama model library for the latest additions.

Updating Ollama

To update Ollama inside a running sandbox:
$ update-ollama
To auto-update on every sandbox start:
$ openshell sandbox create --from ollama -e OLLAMA_UPDATE=1

Option B: Host-level Ollama

Use this approach when you want a single Ollama instance on the gateway host, shared across multiple sandboxes through inference.local.
This approach uses Ollama because it is easy to install and run locally, but you can substitute other inference engines such as vLLM, SGLang, TRT-LLM, and NVIDIA NIM by changing the startup command, base URL, and model name.
1

Install and start Ollama

Install Ollama on the gateway host:
$ curl -fsSL https://ollama.com/install.sh | sh
Start Ollama on all interfaces so it is reachable from sandboxes:
$ OLLAMA_HOST=0.0.0.0:11434 ollama serve
If you see Error: listen tcp 0.0.0.0:11434: bind: address already in use, Ollama is already running as a system service. Stop it first:
$ systemctl stop ollama
$ OLLAMA_HOST=0.0.0.0:11434 ollama serve
2

Pull a model

In a second terminal, pull a model:
$ ollama run qwen3.5:0.8b
Type /bye to exit the interactive session. The model stays loaded.
3

Create a provider

Create an OpenAI-compatible provider pointing at the host Ollama instance:
$ openshell provider create \
    --name ollama \
    --type openai \
    --credential OPENAI_API_KEY=empty \
    --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
OpenShell injects host.openshell.internal so sandboxes and the gateway can reach the host machine. You can also use the host’s LAN IP.
4

Set inference routing

$ openshell inference set --provider ollama --model qwen3.5:0.8b
Confirm the saved config:
$ openshell inference get
5

Verify from a sandbox

$ openshell sandbox create -- \
    curl https://inference.local/v1/chat/completions \
    --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'
The response should be JSON from the model.

Troubleshooting

ProblemFix
Ollama not reachable from sandboxOllama must be bound to 0.0.0.0, not 127.0.0.1. The community sandbox handles this automatically.
Wrong OPENAI_BASE_URLUse http://host.openshell.internal:11434/v1, not localhost or 127.0.0.1.
Model not foundRun ollama ps to confirm the model is loaded. Run ollama pull <model> if needed.
HTTPS vs HTTPCode inside sandboxes must call https://inference.local, not http://.
AMD GPU driver issuesOllama v0.18+ requires ROCm 7 drivers for AMD GPUs. Update your drivers if you see GPU detection failures.
$ openshell status
$ openshell inference get
$ openshell provider get ollama

GPU support for local inference

Both Ollama and LM Studio can use local GPU resources:
  • NVIDIA GPUs: Both tools support CUDA automatically when the appropriate drivers are installed. No additional configuration is required in OpenShell.
  • AMD GPUs: Ollama v0.18+ requires ROCm 7 drivers. LM Studio uses ROCm automatically on supported hardware.
  • Apple Silicon: Both tools use Metal for hardware acceleration on M-series Macs.
  • CPU fallback: If no GPU is detected, inference runs on CPU. For most coding assistant workloads, a small quantized model (such as qwen3.5:0.8b) runs acceptably on CPU.
GPU resources are available to Ollama and LM Studio running on the gateway host. Sandboxes themselves do not have direct GPU access — inference is routed from the sandbox through inference.local to the host-side backend.

What’s next

Managed inference

Learn how OpenShell routes inference requests and manages provider configuration.

Configure inference backends

Configure vLLM, SGLang, TRT-LLM, NVIDIA NIM, or any other OpenAI-compatible backend.

Community sandboxes

Explore pre-built sandbox images for common development workflows.

LM Studio CLI docs

Learn more about the lms CLI for headless LM Studio usage.

Build docs developers (and LLMs) love