Ollama is a lightweight runtime that downloads quantised open-weight models and serves them through a local REST API on portDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/NirDiamant/agents-towards-production/llms.txt
Use this file to discover all available pages before exploring further.
11434. Because everything runs on your hardware, no data leaves the machine — making it ideal for sensitive workloads, air-gapped environments, and cost-conscious deployments where you want to avoid per-token charges.
Data sovereignty
Model weights and all inference stay on your hardware. Nothing is sent to external services.
Predictable costs
No per-token fees. You pay for hardware utilisation only, regardless of request volume.
Drop-in replacement
Swap
ChatOpenAI for ChatOllama in LangChain with one line. The rest of your agent code stays unchanged.Prerequisites
Before you start, confirm your machine meets these minimum requirements:| Resource | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB+ |
| Free storage | 10 GB | 20 GB+ |
| CPU | Any modern x64 / ARM64 | — |
| GPU | Optional | NVIDIA, AMD, or Apple Silicon |
Install Ollama
- macOS / Linux
- Windows
- Docker
Pull a model and start the server
Pull model weights
ollama pull fetches the quantised .gguf weights and caches them locally. Browse all available models at ollama.com/library.Start the Ollama daemon
On Windows, Ollama starts automatically after installation. If you see “only one usage of each socket address is permitted”, the daemon is already running — skip this step.
Call the API from Python
Ollama exposes a standard REST API that you can call with plainrequests or use through the LangChain ChatOllama wrapper.
Replace OpenAI API calls
Replace LangChain models
Tune model behaviour with API parameters
Every request accepts a set of optional parameters that control how the model generates text.Essential parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | — | Required. Model identifier, e.g. "llama3.1:8b". |
messages | array | — | Chat history as {role, content} objects. |
stream | boolean | true | Stream tokens as they are generated. Use false to wait for the full response. |
temperature | float | 0.8 | Controls randomness. 0.0 is deterministic; 2.0 is highly random. |
top_p | float | 0.9 | Nucleus sampling threshold. Lower values produce more conservative outputs. |
num_predict | int | 128 | Maximum tokens to generate. -1 means unlimited. |
repeat_penalty | float | 1.1 | Penalises repeated phrases. Increase to 1.2–1.5 if the model loops. |
system | string | — | System prompt that sets the assistant’s persona or task. |
stop | array | — | Stop generation when any of these strings are encountered. |
Performance parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
num_ctx | int | 2048 | Context window size in tokens. |
num_gpu | int | -1 | GPU layers to offload. -1 is auto; 0 forces CPU-only. |
keep_alive | string | — | Keep model loaded after a request (e.g. "5m", "-1" for forever). |
Example: tuned API call
Example: LangChain with parameters
Build a LangChain analysis agent
The following agent classifies text, extracts key points, and summarises it — using only a local Ollama model.Choose the right model
| Model | RAM needed | Best for | Speed |
|---|---|---|---|
llama3.1:8b | 8 GB | General use, agents | Fast |
qwen2.5:14b | 14 GB | Code, reasoning | Medium |
phi3:14b | 14 GB | Efficient tasks | Fast |
mistral:7b | 7 GB | Simple tasks | Very fast |
Troubleshoot common issues
Model not found
Model not found
Pull the model before running:
Connection refused
Connection refused
Start the Ollama daemon:
Out of memory
Out of memory
Switch to a smaller model such as
mistral:7b, or set num_gpu 0 to run on CPU and reduce VRAM pressure.Next steps
Deploy on RunPod GPU
Package Ollama and your agent into a Docker image and deploy it to RunPod’s serverless GPU infrastructure for scalable cloud inference.
Containerize with Docker
Mount local model weights into a container so your Ollama-backed agent runs identically on any host.