Documentation Index Fetch the complete documentation index at: https://mintlify.com/zenml-io/zenml/llms.txt
Use this file to discover all available pages before exploring further.
The Kubernetes integration provides native orchestration capabilities for running ZenML pipelines on Kubernetes clusters, with full control over pod configuration and resources.
Installation
pip install "zenml[kubernetes]"
This installs:
kubernetes>=21.7,<26 - Kubernetes Python client
Jinja2 - Template engine for Kubernetes manifests
Available Components
The Kubernetes integration provides these stack components:
Kubernetes Orchestrator Execute complete pipelines as Kubernetes Jobs
Kubernetes Step Operator Run individual steps as Kubernetes Pods
Kubernetes Orchestrator
The Kubernetes orchestrator runs your complete pipeline by creating a Kubernetes Job for each step.
Configuration
zenml orchestrator register k8s-orch \
--flavor=kubernetes \
--kubernetes_context=my-cluster-context \
--kubernetes_namespace=zenml
Optional Parameters:
kubernetes_context - kubectl context name (defaults to current context)
kubernetes_namespace - Namespace for pipeline pods (default: zenml)
synchronous - Wait for pipeline completion (default: True)
skip_local_validations - Skip local kubectl checks (default: False)
Prerequisites
Before using the Kubernetes orchestrator:
Running Kubernetes cluster with kubectl access
Container registry accessible from the cluster
kubectl configured with correct context
Namespace created (if not using default)
# Create namespace
kubectl create namespace zenml
# Verify connectivity
kubectl get nodes
Step-Level Pod Configuration
Customize Kubernetes Pods for individual steps using KubernetesPodSettings:
from zenml import step, pipeline
from zenml.integrations.kubernetes.pod_settings import KubernetesPodSettings
@step (
settings = {
"orchestrator.kubernetes" : KubernetesPodSettings(
node_selectors = { "kubernetes.io/hostname" : "gpu-node-1" },
resources = {
"requests" : { "memory" : "16Gi" , "cpu" : "4" },
"limits" : { "memory" : "16Gi" , "cpu" : "4" , "nvidia.com/gpu" : "1" },
},
annotations = { "prometheus.io/scrape" : "true" },
labels = { "team" : "ml-ops" , "component" : "training" },
tolerations = [
{
"key" : "gpu" ,
"operator" : "Equal" ,
"value" : "true" ,
"effect" : "NoSchedule" ,
}
],
affinity = {
"nodeAffinity" : {
"requiredDuringSchedulingIgnoredDuringExecution" : {
"nodeSelectorTerms" : [
{
"matchExpressions" : [
{
"key" : "accelerator" ,
"operator" : "In" ,
"values" : [ "nvidia-tesla-v100" ],
}
]
}
]
}
}
},
volumes = [
{
"name" : "data-volume" ,
"persistentVolumeClaim" : { "claimName" : "training-data-pvc" },
}
],
volume_mounts = [
{ "name" : "data-volume" , "mountPath" : "/data" }
],
)
}
)
def train_on_gpu ( data : pd.DataFrame) -> Model:
# Training code with GPU access
...
@step
def preprocess_data () -> pd.DataFrame:
# Preprocessing with default settings
...
@pipeline
def training_pipeline ():
data = preprocess_data()
train_on_gpu(data)
Available Pod Settings:
node_selectors - Select nodes by labels
affinity - Advanced node selection rules
tolerations - Allow scheduling on tainted nodes
resources - CPU, memory, and GPU requests/limits
annotations - Pod annotations
labels - Pod labels
volumes - Volumes to attach
volume_mounts - Where to mount volumes
env - Environment variables
service_account_name - Kubernetes service account
host_ipc - Use host IPC namespace (for shared memory)
Resource Management
CPU and Memory:
KubernetesPodSettings(
resources = {
"requests" : { "cpu" : "2" , "memory" : "8Gi" },
"limits" : { "cpu" : "4" , "memory" : "16Gi" },
}
)
requests - Guaranteed resources, affects scheduling
limits - Maximum resources, container is killed if exceeded
GPUs:
KubernetesPodSettings(
resources = {
"limits" : {
"nvidia.com/gpu" : "2" , # NVIDIA GPUs
# or "amd.com/gpu": "1" # AMD GPUs
}
},
node_selectors = { "accelerator" : "nvidia-tesla-v100" },
)
Note: GPUs are only specified in limits, not requests.
Node Selection Strategies
Simple Node Selection:
KubernetesPodSettings(
node_selectors = {
"kubernetes.io/hostname" : "specific-node" ,
"node.kubernetes.io/instance-type" : "n1-standard-4" ,
}
)
Advanced Affinity:
KubernetesPodSettings(
affinity = {
"nodeAffinity" : {
"preferredDuringSchedulingIgnoredDuringExecution" : [
{
"weight" : 1 ,
"preference" : {
"matchExpressions" : [
{
"key" : "instance-type" ,
"operator" : "In" ,
"values" : [ "gpu" ],
}
]
},
}
]
},
"podAntiAffinity" : {
"requiredDuringSchedulingIgnoredDuringExecution" : [
{
"labelSelector" : {
"matchLabels" : { "app" : "training" },
},
"topologyKey" : "kubernetes.io/hostname" ,
}
]
},
}
)
Tolerations (for tainted nodes):
KubernetesPodSettings(
tolerations = [
{
"key" : "dedicated" ,
"operator" : "Equal" ,
"value" : "ml-training" ,
"effect" : "NoSchedule" ,
},
{
"key" : "gpu" ,
"operator" : "Exists" ,
"effect" : "NoSchedule" ,
},
]
)
Persistent Storage
Using Persistent Volume Claims:
KubernetesPodSettings(
volumes = [
{
"name" : "training-data" ,
"persistentVolumeClaim" : { "claimName" : "ml-data-pvc" },
}
],
volume_mounts = [
{
"name" : "training-data" ,
"mountPath" : "/mnt/data" ,
"readOnly" : False ,
}
],
)
Using ConfigMaps:
KubernetesPodSettings(
volumes = [
{
"name" : "config" ,
"configMap" : { "name" : "training-config" },
}
],
volume_mounts = [
{ "name" : "config" , "mountPath" : "/etc/config" }
],
)
Using Secrets:
KubernetesPodSettings(
volumes = [
{
"name" : "secrets" ,
"secret" : { "secretName" : "ml-credentials" },
}
],
volume_mounts = [
{ "name" : "secrets" , "mountPath" : "/etc/secrets" , "readOnly" : True }
],
)
Kubernetes Step Operator
The step operator runs individual steps as Kubernetes Pods, allowing hybrid execution.
Configuration
zenml step-operator register k8s-step-op \
--flavor=kubernetes \
--kubernetes_context=my-cluster-context \
--kubernetes_namespace=zenml
Usage
from zenml import step, pipeline
@step ( step_operator = "k8s-step-op" )
def train_on_k8s ( data : pd.DataFrame) -> Model:
# This step runs in Kubernetes
...
@step
def preprocess_locally ( raw_data : pd.DataFrame) -> pd.DataFrame:
# This step runs locally
...
@pipeline
def hybrid_pipeline ():
data = preprocess_locally( ... ) # Local execution
model = train_on_k8s(data) # Kubernetes execution
Service Account Setup
Create a Kubernetes service account for pipelines:
# zenml-service-account.yaml
apiVersion : v1
kind : ServiceAccount
metadata :
name : zenml-sa
namespace : zenml
---
apiVersion : rbac.authorization.k8s.io/v1
kind : Role
metadata :
name : zenml-role
namespace : zenml
rules :
- apiGroups : [ "" ]
resources : [ "pods" , "pods/log" ]
verbs : [ "get" , "list" , "watch" , "create" , "delete" ]
- apiGroups : [ "batch" ]
resources : [ "jobs" ]
verbs : [ "get" , "list" , "watch" , "create" , "delete" ]
---
apiVersion : rbac.authorization.k8s.io/v1
kind : RoleBinding
metadata :
name : zenml-role-binding
namespace : zenml
subjects :
- kind : ServiceAccount
name : zenml-sa
namespace : zenml
roleRef :
kind : Role
name : zenml-role
apiGroup : rbac.authorization.k8s.io
Apply and use:
kubectl apply -f zenml-service-account.yaml
KubernetesPodSettings( service_account_name = "zenml-sa" )
Complete Stack Example
# Register container registry
zenml container-registry register docker-registry \
--flavor=default \
--uri=docker.io/myusername
# Register orchestrator
zenml orchestrator register k8s-orch \
--flavor=kubernetes \
--kubernetes_context=prod-cluster \
--kubernetes_namespace=zenml-prod
# Register artifact store (accessible from cluster)
zenml artifact-store register s3-store \
--flavor=s3 \
--path=s3://my-ml-artifacts
# Create stack
zenml stack register k8s-prod \
-o k8s-orch \
-a s3-store \
-c docker-registry
# Activate
zenml stack set k8s-prod
Best Practices
Prevent resource exhaustion with quotas: apiVersion : v1
kind : ResourceQuota
metadata :
name : zenml-quota
namespace : zenml
spec :
hard :
requests.cpu : "100"
requests.memory : 200Gi
limits.cpu : "200"
limits.memory : 400Gi
nvidia.com/gpu : "10"
Use Pod Security Standards
Apply pod security policies: apiVersion : v1
kind : Namespace
metadata :
name : zenml
labels :
pod-security.kubernetes.io/enforce : restricted
pod-security.kubernetes.io/audit : restricted
pod-security.kubernetes.io/warn : restricted
Use metrics-server to monitor resource consumption: kubectl top pods -n zenml
kubectl top nodes
Use Init Containers for Setup
Use init containers for preprocessing: KubernetesPodSettings(
init_containers = [
{
"name" : "data-downloader" ,
"image" : "busybox" ,
"command" : [ "sh" , "-c" , "wget -O /data/dataset.csv https://example.com/data.csv" ],
"volumeMounts" : [{ "name" : "data" , "mountPath" : "/data" }],
}
]
)
Common Issues
If pods can’t pull images:
Verify container registry credentials
Create image pull secret:
kubectl create secret docker-registry regcred \
--docker-server=docker.io \
--docker-username=myuser \
--docker-password=mypass \
-n zenml
Add to pod settings:
KubernetesPodSettings( image_pull_secrets = [ "regcred" ])
If pods remain pending:
Check node resources: kubectl describe nodes
View pod events: kubectl describe pod POD_NAME -n zenml
Lower resource requests or add more nodes
If you see RBAC errors:
Verify service account exists
Check role bindings are correct
Ensure kubectl context has permissions
Next Steps
Kubeflow Integration Use Kubeflow Pipelines on Kubernetes
Container Registries Configure image registries
Remote Execution Production deployment patterns
Kubernetes Docs Official Kubernetes documentation