Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/NirDiamant/agents-towards-production/llms.txt

Use this file to discover all available pages before exploring further.

Ollama is a lightweight runtime that downloads quantised open-weight models and serves them through a local REST API on port 11434. Because everything runs on your hardware, no data leaves the machine — making it ideal for sensitive workloads, air-gapped environments, and cost-conscious deployments where you want to avoid per-token charges.

Data sovereignty

Model weights and all inference stay on your hardware. Nothing is sent to external services.

Predictable costs

No per-token fees. You pay for hardware utilisation only, regardless of request volume.

Drop-in replacement

Swap ChatOpenAI for ChatOllama in LangChain with one line. The rest of your agent code stays unchanged.

Prerequisites

Before you start, confirm your machine meets these minimum requirements:
ResourceMinimumRecommended
RAM8 GB16 GB+
Free storage10 GB20 GB+
CPUAny modern x64 / ARM64
GPUOptionalNVIDIA, AMD, or Apple Silicon

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Pull a model and start the server

1

Pull model weights

ollama pull fetches the quantised .gguf weights and caches them locally. Browse all available models at ollama.com/library.
ollama pull llama3.1:8b
2

Start the Ollama daemon

ollama serve
On Windows, Ollama starts automatically after installation. If you see “only one usage of each socket address is permitted”, the daemon is already running — skip this step.
3

Verify the server is ready

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'
You should receive a JSON response with the model’s reply in message.content.

Call the API from Python

Ollama exposes a standard REST API that you can call with plain requests or use through the LangChain ChatOllama wrapper.

Replace OpenAI API calls

from openai import OpenAI

prompt = "Hello!"
client = OpenAI(api_key="your-key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

Replace LangChain models

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4")
response = llm.invoke("Hello!")

Tune model behaviour with API parameters

Every request accepts a set of optional parameters that control how the model generates text.

Essential parameters

ParameterTypeDefaultDescription
modelstringRequired. Model identifier, e.g. "llama3.1:8b".
messagesarrayChat history as {role, content} objects.
streambooleantrueStream tokens as they are generated. Use false to wait for the full response.
temperaturefloat0.8Controls randomness. 0.0 is deterministic; 2.0 is highly random.
top_pfloat0.9Nucleus sampling threshold. Lower values produce more conservative outputs.
num_predictint128Maximum tokens to generate. -1 means unlimited.
repeat_penaltyfloat1.1Penalises repeated phrases. Increase to 1.2–1.5 if the model loops.
systemstringSystem prompt that sets the assistant’s persona or task.
stoparrayStop generation when any of these strings are encountered.

Performance parameters

ParameterTypeDefaultDescription
num_ctxint2048Context window size in tokens.
num_gpuint-1GPU layers to offload. -1 is auto; 0 forces CPU-only.
keep_alivestringKeep model loaded after a request (e.g. "5m", "-1" for forever).

Example: tuned API call

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "stream": False,
    "temperature": 0.3,      # Lower for factual responses
    "top_p": 0.9,            # Nucleus sampling
    "num_predict": 500,      # Limit response length
    "repeat_penalty": 1.2,   # Reduce repetition
    "stop": ["```", "---"]   # Stop at code blocks or separators
})
data = response.json()
print(data["message"]["content"])

Example: LangChain with parameters

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.7,
    top_p=0.9,
    num_predict=256,
    repeat_penalty=1.1
)
response = llm.invoke("Hello!")
print(response.content)
Set keep_alive to avoid reloading model weights between requests. Use stream: false to simplify response handling, and keep num_predict small to reduce latency in agent loops.
Large num_ctx values require proportionally more RAM or VRAM. Start with the default 2048 and increase only if your use case needs longer context.

Build a LangChain analysis agent

The following agent classifies text, extracts key points, and summarises it — using only a local Ollama model.
import asyncio
from typing import Dict, List
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

class SimpleAnalysisAgent:
    """A simple agent that analyzes text and provides insights."""

    def __init__(self, model_name: str = "llama3.1:8b"):
        self.llm = ChatOllama(model=model_name, temperature=0.1)

    def classify_text(self, text: str) -> str:
        """Classify the type of text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Classify this text as one of: news, blog, email, code, academic, or other. Respond with just the category."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text[:500]})
        return result.content.strip().lower()

    def extract_key_points(self, text: str) -> List[str]:
        """Extract key points from text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Extract 3-5 key points from this text. Return as a simple numbered list."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text})
        lines = result.content.strip().split('\n')
        return [l.strip() for l in lines if l.strip() and any(c.isdigit() for c in l[:3])][:5]

    def summarize(self, text: str) -> str:
        """Create a summary of the text."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Summarize this text in 2-3 sentences. Be concise and clear."),
            ("human", "{text}")
        ])
        chain = prompt | self.llm
        result = chain.invoke({"text": text})
        return result.content.strip()

    def analyze_text(self, text: str) -> Dict:
        """Complete analysis of text."""
        return {
            "category": self.classify_text(text),
            "key_points": self.extract_key_points(text),
            "summary": self.summarize(text),
            "length": len(text)
        }
Install dependencies before running:
pip install langchain langchain-community requests

Choose the right model

ModelRAM neededBest forSpeed
llama3.1:8b8 GBGeneral use, agentsFast
qwen2.5:14b14 GBCode, reasoningMedium
phi3:14b14 GBEfficient tasksFast
mistral:7b7 GBSimple tasksVery fast

Troubleshoot common issues

Pull the model before running:
ollama pull <model-name>
Start the Ollama daemon:
ollama serve
Switch to a smaller model such as mistral:7b, or set num_gpu 0 to run on CPU and reduce VRAM pressure.

Next steps

Deploy on RunPod GPU

Package Ollama and your agent into a Docker image and deploy it to RunPod’s serverless GPU infrastructure for scalable cloud inference.

Containerize with Docker

Mount local model weights into a container so your Ollama-backed agent runs identically on any host.

Build docs developers (and LLMs) love