RunPod is a GPU-optimised cloud platform that lets you deploy containerised AI agents as serverless endpoints. Instead of provisioning servers or configuring load balancers, you package your agent into a Docker image, push it to RunPod, and the platform handles scaling, container lifecycle, and resource allocation automatically. You only pay for the compute time your agent actually uses.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NirDiamant/agents-towards-production/llms.txt
Use this file to discover all available pages before exploring further.
Automatic scaling
Containers spin up and down based on request volume. RunPod pre-warms workers so your endpoint is always ready.
Cost efficiency
Pay only for active compute time. Idle workers do not incur charges.
Zero infrastructure management
No servers to configure, no load balancers to set up. RunPod manages all of it.
GPU-ready base images
Start from a RunPod PyTorch image with CUDA already installed, then add your application on top.
How serverless endpoints work
RunPod executes a Python handler function whenever your endpoint receives a request. You bind the function using therunpod SDK, and RunPod calls it with a JSON payload for every incoming job.
The handler receives every job as a dictionary with an "input" key containing the caller’s payload.
Prerequisites
- A RunPod account
- Docker installed locally
- A Docker Hub account (or another container registry)
Step 1: Define the handler
Createhandler.py. The handler function reads the job input, runs your agent logic, and returns a result or error dictionary. Register it with runpod.serverless.start.
handler.py
Initialise the LLM and agent at module level, outside the handler function. RunPod reuses the same container process across requests, so global setup runs only once and speeds up subsequent calls.
Step 2: Write the Dockerfile
The Dockerfile builds the complete runtime — base image, Python dependencies, Ollama, and the model weights — into a single portable image. Baking the model into the image means RunPod does not need to download it when a new container starts, which reduces cold-start latency.- Full Dockerfile
- requirements.txt
- start.sh
Dockerfile
Step 3: Build and push the image
Build the image for thelinux/amd64 platform (required by RunPod’s infrastructure) and push it to Docker Hub in one command:
Step 4: Deploy to RunPod
Open the Serverless tab
Log in to RunPod and navigate to Serverless in the left sidebar. Click New Endpoint.
Select your image source
Choose Docker Image and enter your image name, for example
your-dockerhub-username/agents:1.0.Alternatively, connect a GitHub repo and RunPod will build and deploy new commits automatically.Select GPU hardware
RunPod presents a prioritised list of GPU types. Select multiple GPU tiers — RunPod rotates through them based on availability to minimise wait times.
Configure workers
Set the minimum and maximum number of workers. Workers in an
idle state do not incur charges — you only pay for active compute during job execution.RunPod may allocate a few extra workers beyond your maximum to ensure your
max workers count remains available even when some are handling requests.Step 5: Test your endpoint
Once deployed, RunPod displays your endpoint ID in the dashboard. Send a request using the built-in test UI or programmatically.- curl
- Python
Update your deployment
Docker image update
Push a new image tag. In the RunPod dashboard, update the endpoint to point to the new tag. RunPod performs a rolling update, swapping workers to the new image without downtime.
GitHub integration
Connect your repository to RunPod. Every new commit triggers an automatic build and rolling deployment — no manual steps required.
Next steps
Run LLMs locally with Ollama
Learn how Ollama works before packaging it for RunPod — pull models, call the REST API, and integrate with LangChain.
Containerize with Docker
Deepen your understanding of Dockerfiles, layer caching, and environment variable injection before deploying to the cloud.