Deploy AI agents on RunPod serverless GPU infrastructure

RunPod is a GPU-optimised cloud platform that lets you deploy containerised AI agents as serverless endpoints. Instead of provisioning servers or configuring load balancers, you package your agent into a Docker image, push it to RunPod, and the platform handles scaling, container lifecycle, and resource allocation automatically. You only pay for the compute time your agent actually uses.

Automatic scaling

Containers spin up and down based on request volume. RunPod pre-warms workers so your endpoint is always ready.

Cost efficiency

Pay only for active compute time. Idle workers do not incur charges.

Zero infrastructure management

No servers to configure, no load balancers to set up. RunPod manages all of it.

GPU-ready base images

Start from a RunPod PyTorch image with CUDA already installed, then add your application on top.

How serverless endpoints work

RunPod executes a Python handler function whenever your endpoint receives a request. You bind the function using the runpod SDK, and RunPod calls it with a JSON payload for every incoming job. The handler receives every job as a dictionary with an "input" key containing the caller’s payload.

Prerequisites

A RunPod account
Docker installed locally
A Docker Hub account (or another container registry)

Step 1: Define the handler

Create handler.py. The handler function reads the job input, runs your agent logic, and returns a result or error dictionary. Register it with runpod.serverless.start.

handler.py

"""RunPod handler for CrewAI blog generation."""

import runpod
from crewai import Agent, Task, Crew, LLM
from crewai.tools import tool

# Configure Ollama LLM — loaded once at container startup
llm = LLM(model="ollama/openhermes", base_url="http://localhost:11434")

@tool("Research Tool")
def fake_research(topic: str) -> str:
    """Pretends to search for information about a topic."""
    return f"Key facts about {topic}: adoption is growing, costs are falling, and expert interest is high."

blog_writer = Agent(
    role="Blog Writer",
    goal="Write engaging and informative blog posts on various topics",
    backstory="You are a professional blog writer known for well-researched, clear articles.",
    tools=[fake_research],
    verbose=True,
    llm=llm
)

def create_blog_post(topic):
    """Creates a blog post on the given topic using CrewAI."""
    blog_task = Task(
        description=f"""
        Write a blog post about {topic}.
        Your blog should have an attention-grabbing title, a brief introduction,
        3-4 main points supported by research, and a conclusion.
        Use the Research Tool to gather facts about {topic}.
        """,
        expected_output="A well-structured blog post of approximately 500 words",
        agent=blog_writer
    )
    crew = Crew(agents=[blog_writer], tasks=[blog_task], verbose=True, llm=llm)
    result = crew.kickoff()
    return result.raw

def handler(job):
    """Handler function that will be used to process jobs."""
    job_input = job["input"]
    topic = job_input.get("topic", "technology")

    try:
        blog_post = create_blog_post(topic)
        return {"status": "success", "blog_post": blog_post}
    except Exception as e:
        return {"status": "error", "message": str(e)}


runpod.serverless.start({"handler": handler})

Initialise the LLM and agent at module level, outside the handler function. RunPod reuses the same container process across requests, so global setup runs only once and speeds up subsequent calls.

Step 2: Write the Dockerfile

The Dockerfile builds the complete runtime — base image, Python dependencies, Ollama, and the model weights — into a single portable image. Baking the model into the image means RunPod does not need to download it when a new container starts, which reduces cold-start latency.

Full Dockerfile
requirements.txt
start.sh

Dockerfile

FROM runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04

ENV PYTHONUNBUFFERED=1

# Install system dependencies
RUN apt-get update --yes --quiet && \
    DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
    software-properties-common gpg-agent build-essential \
    apt-utils ca-certificates curl && \
    add-apt-repository --yes ppa:deadsnakes/ppa && \
    apt-get update --yes --quiet && \
    DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
    python3.11 python3.11-venv python3.11-dev

# Create and activate a Python virtual environment
RUN python3.11 -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"

RUN ln -sf $(which python3.11) /usr/local/bin/python && \
    ln -sf $(which python3.11) /usr/local/bin/python3

# Install Python dependencies
COPY requirements.txt /requirements.txt
RUN pip install --upgrade pip && \
    pip install uv && \
    uv pip install --upgrade -r /requirements.txt --no-cache-dir && \
    uv pip install "langchain-community>=0.0.34" --no-cache-dir

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Download model during build — baked into the image for faster cold starts
RUN ollama serve > /dev/null 2>&1 & \
    sleep 25 && \
    ollama pull openhermes && \
    sleep 10 && \
    pkill ollama

# Add application files
ADD handler.py .
ADD start.sh /start.sh
RUN chmod +x /start.sh

CMD ["/start.sh"]

requirements.txt

crewai>=0.12
crewai-tools>=0.1
runpod~=1.7.9

start.sh

#!/bin/bash
set -e

start_ollama() {
    echo "Starting Ollama service..."
    nohup ollama serve > /ollama.log 2>&1 &

    echo "Waiting for Ollama to initialize..."
    until curl -s http://localhost:11434/api/version >/dev/null; do
        sleep 1
    done

    echo "Loading openhermes model..."
    ollama run openhermes > /openhermes.log 2>&1 &

    sleep 3
    echo "Available models:"
    ollama list
}

start_ollama

echo "Starting serverless handler..."
python handler.py

sleep infinity

RunPod provides a range of pre-built base images. runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 ships with CUDA, cuDNN, and common ML libraries already installed, giving you a solid starting point without extra setup.

Step 3: Build and push the image

Build the image for the linux/amd64 platform (required by RunPod’s infrastructure) and push it to Docker Hub in one command:

docker build -t your-dockerhub-username/agents:1.0 . --push --platform linux/amd64

Step 4: Deploy to RunPod

Open the Serverless tab

Select your image source

Choose Docker Image and enter your image name, for example your-dockerhub-username/agents:1.0.Alternatively, connect a GitHub repo and RunPod will build and deploy new commits automatically.

Select GPU hardware

RunPod presents a prioritised list of GPU types. Select multiple GPU tiers — RunPod rotates through them based on availability to minimise wait times.

Configure workers

Set the minimum and maximum number of workers. Workers in an idle state do not incur charges — you only pay for active compute during job execution.

RunPod may allocate a few extra workers beyond your maximum to ensure your max workers count remains available even when some are handling requests.

Enable FlashBoot (optional)

For high-traffic endpoints, enable FlashBoot to reduce cold-start times. FlashBoot caches container state so workers initialise significantly faster on scale-out.

Step 5: Test your endpoint

Once deployed, RunPod displays your endpoint ID in the dashboard. Send a request using the built-in test UI or programmatically.

curl
Python

curl --request POST \
     --url https://api.runpod.ai/v2/[ENDPOINT_ID]/run \
     --header "accept: application/json" \
     --header "authorization: [YOUR_API_KEY]" \
     --header "content-type: application/json" \
     --data '{
       "input": {
         "topic": "Technology"
       }
     }'

import requests

response = requests.post(
    "https://api.runpod.ai/v2/[ENDPOINT_ID]/run",
    headers={
        "accept": "application/json",
        "authorization": "[YOUR_API_KEY]",
        "content-type": "application/json"
    },
    json={"input": {"topic": "Technology"}}
)
print(response.json())

Expected response:

{
  "status": "success",
  "blog_post": "..."
}

Generate and store your RunPod API key securely. Do not embed it directly in source code — use environment variables or a secrets manager.

Update your deployment

Docker image update

Push a new image tag. In the RunPod dashboard, update the endpoint to point to the new tag. RunPod performs a rolling update, swapping workers to the new image without downtime.

GitHub integration

Connect your repository to RunPod. Every new commit triggers an automatic build and rolling deployment — no manual steps required.

Next steps

Run LLMs locally with Ollama

Learn how Ollama works before packaging it for RunPod — pull models, call the REST API, and integrate with LangChain.

Containerize with Docker

Deepen your understanding of Dockerfiles, layer caching, and environment variable injection before deploying to the cloud.

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Deploy AI agents on RunPod serverless GPU infrastructure

Automatic scaling

Cost efficiency

Zero infrastructure management

GPU-ready base images

How serverless endpoints work

Prerequisites

Step 1: Define the handler

Step 2: Write the Dockerfile

Step 3: Build and push the image

Step 4: Deploy to RunPod

Step 5: Test your endpoint

Update your deployment

Docker image update

GitHub integration

Next steps

Run LLMs locally with Ollama

Containerize with Docker

Build docs developers (and LLMs) love

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Documentation Index

Automatic scaling

Cost efficiency

Zero infrastructure management

GPU-ready base images

​How serverless endpoints work

​Prerequisites

​Step 1: Define the handler

​Step 2: Write the Dockerfile

​Step 3: Build and push the image

​Step 4: Deploy to RunPod

​Step 5: Test your endpoint

​Update your deployment

Docker image update

GitHub integration

​Next steps

Run LLMs locally with Ollama

Containerize with Docker

Build docs developers (and LLMs) love

How serverless endpoints work

Prerequisites

Step 1: Define the handler

Step 2: Write the Dockerfile

Step 3: Build and push the image

Step 4: Deploy to RunPod

Step 5: Test your endpoint

Update your deployment

Next steps