Documentation Index Fetch the complete documentation index at: https://mintlify.com/badlogic/pi-mono/llms.txt
Use this file to discover all available pages before exploring further.
The @mariozechner/pi-pods CLI simplifies running large language models on remote GPU pods with automatic vLLM configuration for agentic workloads.
Key Features
Automatic Setup Sets up vLLM on fresh Ubuntu pods automatically
Tool Calling Configures tool calling for agentic models
Smart GPU Allocation Manages multiple models with automatic GPU assignment
OpenAI Compatible Provides OpenAI-compatible API endpoints
Installation
npm install -g @mariozechner/pi
Quick Start
Set Environment Variables
export HF_TOKEN = your_huggingface_token
export PI_API_KEY = your_api_key
Setup Pod
pi pods setup dc1 "ssh root@1.2.3.4" \
--mount "sudo mount -t nfs nfs.server:/path /mnt/models"
Start Model
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
Test with Agent
# Single message
pi agent qwen "What is the Fibonacci sequence?"
# Interactive mode
pi agent qwen -i
Supported Providers
DataCrunch (Recommended)
NFS volumes shareable across pods
Models download once, use everywhere
Best for teams or multiple experiments
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/pseudo /mnt/models"
RunPod
Network volumes with good persistence
Cannot share between running pods
Good for single-pod workflows
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
Also Works With
Vast.ai
Prime Intellect
AWS EC2 with EFS
Any Ubuntu machine with NVIDIA GPUs
Pod Management
Setup New Pod
pi pods setup < nam e > "<ssh>" [options]
--mount "<mount_command>" # Run mount command during setup
--models-path < pat h > # Override extracted path
--vllm release | nightly | gpt-oss # vLLM version
List and Manage Pods
pi pods # List all configured pods
pi pods active < nam e > # Switch active pod
pi pods remove < nam e > # Remove pod from config
pi shell [<name>] # SSH into pod
pi ssh [<name>] "<cmd>" # Run command on pod
Model Management
Start Models
pi start < mode l > --name < nam e > [options]
--memory < percen t > # GPU memory: 30%, 50%, 90%
--context < siz e > # Context: 4k, 8k, 16k, 32k, 64k, 128k
--gpus < coun t > # Number of GPUs
--pod < nam e > # Target specific pod
--vllm < args.. . > # Custom vLLM args
Manage Running Models
pi stop [<name>] # Stop model (or all)
pi list # List running models
pi logs < nam e > # Stream model logs
Predefined Models
Qwen Models
# Qwen2.5-Coder-32B - Excellent coding model
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
# Qwen3-Coder-30B - Advanced reasoning
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
# Qwen3-Coder-480B - 8xH200 required
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
GPT-OSS Models
# Requires special vLLM build
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
pi start openai/gpt-oss-20b --name gpt20
pi start openai/gpt-oss-120b --name gpt120
GLM Models
pi start zai-org/GLM-4.5 --name glm
pi start zai-org/GLM-4.5-Air --name glm-air
Custom Models
# DeepSeek with custom settings
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
--tensor-parallel-size 4 --trust-remote-code
# Any model with specific parser
pi start some/model --name mymodel --vllm \
--tool-call-parser hermes --enable-auto-tool-choice
Multi-GPU Support
Automatic Assignment
pi start model1 --name m1 # Auto-assigns GPU 0
pi start model2 --name m2 # Auto-assigns GPU 1
pi start model3 --name m3 # Auto-assigns GPU 2
Specify GPU Count
# Run on 1 GPU instead of all
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
# Run on 8 GPUs
pi start zai-org/GLM-4.5 --name glm --gpus 8
Tensor Parallelism
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
--tensor-parallel-size 4
Agent Interface
Single Messages
pi agent < nam e > "<message>"
pi agent < nam e > "<msg1>" "<msg2>" # Multiple messages
Interactive Mode
pi agent < nam e > -i # Interactive chat
pi agent < nam e > -i -c # Continue previous session
Standalone Agent
# Works with any OpenAI-compatible API
pi-agent --base-url http://localhost:8000/v1 --model model-name "Hello"
pi-agent --api-key sk-... "What is 2+2?"
pi-agent --json "What is 2+2?" # JSONL output
pi-agent -i # Interactive mode
API Integration
All models expose OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url = "http://your-pod-ip:8001/v1" ,
api_key = "your-pi-api-key"
)
response = client.chat.completions.create(
model = "Qwen/Qwen2.5-Coder-32B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
)
Memory and Context
GPU Memory Allocation
--memory 30% - High concurrency, limited context
--memory 50% - Balanced (default)
--memory 90% - Maximum context, low concurrency
Context Window
--context 4k - 4,096 tokens
--context 32k - 32,768 tokens
--context 128k - 131,072 tokens
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
--context 64k --memory 70%
Automatic configuration for known models:
Qwen : hermes parser
GLM : glm4_moe parser with reasoning
GPT-OSS : Uses /v1/responses endpoint
Custom : Specify with --vllm --tool-call-parser <parser>
Disable tool calling:
pi start model --name mymodel --vllm --disable-tool-call-parser
Troubleshooting
OOM Errors
Reduce --memory percentage
Use quantized version (FP8)
Reduce --context size
Model Won’t Start
pi ssh "nvidia-smi" # Check GPU usage
pi list # Check port conflicts
pi stop # Force stop all
Try different parser: --vllm --tool-call-parser mistral
Or disable: --vllm --disable-tool-call-parser
Environment Variables
Variable Description HF_TOKENHuggingFace token for downloads PI_API_KEYAPI key for vLLM endpoints PI_CONFIG_DIRConfig directory (default: ~/.pi) OPENAI_API_KEYUsed by pi-agent
Next Steps
DataCrunch Setup Detailed DataCrunch configuration
RunPod Setup RunPod configuration guide
GitHub Repository View source code and examples