Run LLMs locally on your own hardware using Ollama

Ollama is a lightweight runtime that downloads quantised open-weight models and serves them through a local REST API on port 11434. Because everything runs on your hardware, no data leaves the machine — making it ideal for sensitive workloads, air-gapped environments, and cost-conscious deployments where you want to avoid per-token charges.

Data sovereignty

Model weights and all inference stay on your hardware. Nothing is sent to external services.

Predictable costs

No per-token fees. You pay for hardware utilisation only, regardless of request volume.

Drop-in replacement

Swap ChatOpenAI for ChatOllama in LangChain with one line. The rest of your agent code stays unchanged.

Prerequisites

Before you start, confirm your machine meets these minimum requirements:

Resource	Minimum	Recommended
RAM	8 GB	16 GB+
Free storage	10 GB	20 GB+
CPU	Any modern x64 / ARM64	—
GPU	Optional	NVIDIA, AMD, or Apple Silicon

Install Ollama

macOS / Linux
Windows
Docker

curl -fsSL https://ollama.com/install.sh | sh

Download and run the .exe installer from ollama.com/download. Ollama starts automatically after installation — you can skip ollama serve.

docker run -d -p 11434:11434 --name ollama ollama/ollama

See the official Docker image guide for GPU passthrough options.

Pull a model and start the server

Pull model weights

ollama pull fetches the quantised .gguf weights and caches them locally. Browse all available models at ollama.com/library.

ollama pull llama3.1:8b

Start the Ollama daemon

ollama serve

On Windows, Ollama starts automatically after installation. If you see “only one usage of each socket address is permitted”, the daemon is already running — skip this step.

Verify the server is ready

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

You should receive a JSON response with the model’s reply in message.content.

Call the API from Python

Ollama exposes a standard REST API that you can call with plain requests or use through the LangChain ChatOllama wrapper.

Replace OpenAI API calls

from openai import OpenAI

prompt = "Hello!"
client = OpenAI(api_key="your-key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

Replace LangChain models

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4")
response = llm.invoke("Hello!")

Tune model behaviour with API parameters

Every request accepts a set of optional parameters that control how the model generates text.

Essential parameters

Parameter	Type	Default	Description
`model`	string	—	Required. Model identifier, e.g. `"llama3.1:8b"`.
`messages`	array	—	Chat history as `{role, content}` objects.
`stream`	boolean	`true`	Stream tokens as they are generated. Use `false` to wait for the full response.
`temperature`	float	`0.8`	Controls randomness. `0.0` is deterministic; `2.0` is highly random.
`top_p`	float	`0.9`	Nucleus sampling threshold. Lower values produce more conservative outputs.
`num_predict`	int	`128`	Maximum tokens to generate. `-1` means unlimited.
`repeat_penalty`	float	`1.1`	Penalises repeated phrases. Increase to `1.2–1.5` if the model loops.
`system`	string	—	System prompt that sets the assistant’s persona or task.
`stop`	array	—	Stop generation when any of these strings are encountered.

Performance parameters

Parameter	Type	Default	Description
`num_ctx`	int	`2048`	Context window size in tokens.
`num_gpu`	int	`-1`	GPU layers to offload. `-1` is auto; `0` forces CPU-only.
`keep_alive`	string	—	Keep model loaded after a request (e.g. `"5m"`, `"-1"` for forever).

Example: tuned API call

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "stream": False,
    "temperature": 0.3,      # Lower for factual responses
    "top_p": 0.9,            # Nucleus sampling
    "num_predict": 500,      # Limit response length
    "repeat_penalty": 1.2,   # Reduce repetition
    "stop": ["```", "---"]   # Stop at code blocks or separators
})
data = response.json()
print(data["message"]["content"])

Example: LangChain with parameters

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.7,
    top_p=0.9,
    num_predict=256,
    repeat_penalty=1.1
)
response = llm.invoke("Hello!")
print(response.content)

Set keep_alive to avoid reloading model weights between requests. Use stream: false to simplify response handling, and keep num_predict small to reduce latency in agent loops.

Large num_ctx values require proportionally more RAM or VRAM. Start with the default 2048 and increase only if your use case needs longer context.

Build a LangChain analysis agent

The following agent classifies text, extracts key points, and summarises it — using only a local Ollama model.

import asyncio
from typing import Dict, List
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

class SimpleAnalysisAgent:
    """A simple agent that analyzes text and provides insights."""

    def __init__(self, model_name: str = "llama3.1:8b"):
        self.llm = ChatOllama(model=model_name, temperature=0.1)

    def classify_text(self, text: str) -> str:
        """Classify the type of text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Classify this text as one of: news, blog, email, code, academic, or other. Respond with just the category."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text[:500]})
        return result.content.strip().lower()

    def extract_key_points(self, text: str) -> List[str]:
        """Extract key points from text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Extract 3-5 key points from this text. Return as a simple numbered list."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text})
        lines = result.content.strip().split('\n')
        return [l.strip() for l in lines if l.strip() and any(c.isdigit() for c in l[:3])][:5]

    def summarize(self, text: str) -> str:
        """Create a summary of the text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Summarize this text in 2-3 sentences. Be concise and clear."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text})
        return result.content.strip()

    def analyze_text(self, text: str) -> Dict:
        """Complete analysis of text."""
        return {
            "category": self.classify_text(text),
            "key_points": self.extract_key_points(text),
            "summary": self.summarize(text),
            "length": len(text)
        }

Install dependencies before running:

pip install langchain langchain-community requests

Choose the right model

Model	RAM needed	Best for	Speed
`llama3.1:8b`	8 GB	General use, agents	Fast
`qwen2.5:14b`	14 GB	Code, reasoning	Medium
`phi3:14b`	14 GB	Efficient tasks	Fast
`mistral:7b`	7 GB	Simple tasks	Very fast

Troubleshoot common issues

Model not found

Pull the model before running:

ollama pull <model-name>

Connection refused

Start the Ollama daemon:

ollama serve

Out of memory

Switch to a smaller model such as mistral:7b, or set num_gpu 0 to run on CPU and reduce VRAM pressure.

Next steps

Deploy on RunPod GPU

Package Ollama and your agent into a Docker image and deploy it to RunPod’s serverless GPU infrastructure for scalable cloud inference.

Containerize with Docker

Mount local model weights into a container so your Ollama-backed agent runs identically on any host.

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Run LLMs locally on your own hardware using Ollama

Data sovereignty

Predictable costs

Drop-in replacement

Prerequisites

Install Ollama

Pull a model and start the server

Call the API from Python

Replace OpenAI API calls

Replace LangChain models

Tune model behaviour with API parameters

Essential parameters

Performance parameters

Example: tuned API call

Example: LangChain with parameters

Build a LangChain analysis agent

Choose the right model

Troubleshoot common issues

Next steps

Deploy on RunPod GPU

Containerize with Docker

Build docs developers (and LLMs) love

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Documentation Index

Data sovereignty

Predictable costs

Drop-in replacement

​Prerequisites

​Install Ollama

​Pull a model and start the server

​Call the API from Python

​Replace OpenAI API calls

​Replace LangChain models

​Tune model behaviour with API parameters

​Essential parameters

​Performance parameters

​Example: tuned API call

​Example: LangChain with parameters

​Build a LangChain analysis agent

​Choose the right model

​Troubleshoot common issues

​Next steps

Deploy on RunPod GPU

Containerize with Docker

Build docs developers (and LLMs) love

Prerequisites

Install Ollama

Pull a model and start the server

Call the API from Python

Replace OpenAI API calls

Replace LangChain models

Tune model behaviour with API parameters

Essential parameters

Performance parameters

Example: tuned API call

Example: LangChain with parameters

Build a LangChain analysis agent

Choose the right model

Troubleshoot common issues

Next steps