Documentation Index Fetch the complete documentation index at: https://mintlify.com/kyryl-opens-ml/ml-in-production-practice/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This module includes two homework assignments focused on deploying ML models through various serving approaches:
H9 : API and UI serving with FastAPI, Streamlit, and Gradio
H10 : Inference servers with Seldon, KServe, Triton, Ray, and vLLM
H9: API Serving
Learning Objectives
REST APIs Build production-ready APIs with FastAPI
Web UIs Create interactive interfaces with Streamlit/Gradio
Testing Write comprehensive integration tests
Kubernetes Deploy services to K8s with proper manifests
Reading List
Tasks
PR1: Streamlit UI
Objective : Create an interactive web UI for your modelRequirements:
Single prediction interface with text input
Batch prediction with CSV upload
Unit tests for both interfaces
CI integration (pytest in GitHub Actions)
Reference implementation: import streamlit as st
from serving.predictor import Predictor
@st.cache_data
def get_model ():
return Predictor.default_from_model_registry()
def single_pred ():
input_sent = st.text_input( "Type english sentence" )
if st.button( "Run inference" ):
pred = predictor.predict([input_sent])
st.write( "Pred:" , pred)
Testing: from streamlit.testing.v1 import AppTest
def test_single_prediction ():
at = AppTest.from_file( "serving/ui_app.py" )
at.run()
at.text_input[ 0 ].set_value( "test" ).run()
at.button[ 0 ].click().run()
assert "Pred:" in at.text[ 0 ].value
PR2: Gradio UI
Objective : Build alternative UI with GradioRequirements:
Similar functionality to Streamlit
Component-based interface
Tests with gr.Interface.test_launch()
CI integration
Example: import gradio as gr
from serving.predictor import Predictor
predictor = Predictor.default_from_model_registry()
def predict ( text ):
return predictor.predict([text])[ 0 ].tolist()
interface = gr.Interface(
fn = predict,
inputs = gr.Textbox( label = "Input text" ),
outputs = gr.Label( label = "Predictions" )
)
if __name__ == "__main__" :
interface.launch()
PR3: FastAPI Server
Objective : Implement production-ready REST APIRequirements:
Pydantic models for validation
/health_check endpoint
/predict endpoint with batch support
Comprehensive tests with TestClient
CI integration
Reference: from fastapi import FastAPI
from pydantic import BaseModel
class Payload ( BaseModel ):
text: List[ str ]
app = FastAPI()
@app.get ( "/health_check" )
def health_check () -> str :
return "ok"
@app.post ( "/predict" )
def predict ( payload : Payload):
prediction = predictor.predict( text = payload.text)
return { "probs" : prediction.tolist()}
Testing: from fastapi.testclient import TestClient
def test_predict ():
response = client.post( "/predict" , json = { "text" : [ "test" ]})
assert response.status_code == 200
assert len (response.json()[ "probs" ][ 0 ]) == 2
PR4: API Kubernetes Deployment
Objective : Deploy FastAPI to KubernetesRequirements:
Deployment manifest with 2+ replicas
Service manifest (ClusterIP)
ConfigMaps for configuration
Secrets for API keys (W&B)
Resource limits/requests
Example: apiVersion : apps/v1
kind : Deployment
metadata :
name : app-fastapi
spec :
replicas : 2
template :
spec :
containers :
- name : app-fastapi
image : your-registry/app-fastapi:latest
resources :
requests :
memory : "512Mi"
cpu : "500m"
limits :
memory : "1Gi"
cpu : "1000m"
PR5: UI Kubernetes Deployment
Objective : Deploy Streamlit/Gradio to KubernetesRequirements:
Deployment manifest (single replica for session state)
Service manifest
Ingress configuration (optional)
Health checks
Example: apiVersion : apps/v1
kind : Deployment
metadata :
name : app-streamlit
spec :
replicas : 1 # Single replica for session state
template :
spec :
containers :
- name : app-streamlit
image : your-registry/app-streamlit:latest
livenessProbe :
httpGet :
path : /_stcore/health
port : 8080
Google Doc Update
Objective : Document model serving planInclude:
API design decisions (endpoints, formats)
UI/UX considerations
Deployment architecture
Scaling strategy
Monitoring plan
Tradeoffs between serving options
Success Criteria
5 PRs merged with passing CI
All tests pass (pytest, API tests, UI tests)
Deployments run successfully on K8s
Google doc includes serving architecture
H10: Inference Servers
Learning Objectives
Production Serving Deploy with Seldon, KServe, and Triton
Performance Optimize throughput with batching and GPUs
LLM Serving Serve LLMs with vLLM and LoRA adapters
Comparison Evaluate tradeoffs between solutions
Reading List
Tasks
PR1: Seldon API Deployment
Objective : Deploy model with Seldon CoreRequirements:
Implement Seldon protocol wrapper
Create SeldonDeployment manifest
Write integration tests
Document comparison with vanilla K8s deployment
Example: class SeldonModel :
def __init__ ( self ):
self .predictor = Predictor.default_from_model_registry()
def predict ( self , X , features_names = None ):
# X is numpy array or list
predictions = self .predictor.predict(X)
return predictions
PR2: KServe API Integration
Objective : Deploy with KServe InferenceServiceRequirements:
Implement KServe Model class
Create InferenceService manifest
Test V1/V2 inference protocol
Configure autoscaling
Reference: See KServe documentation
PR3: Triton Inference Server
Objective : Deploy with NVIDIA TritonRequirements:
Implement PyTriton wrapper
Configure dynamic batching
Create model configuration
Write client tests
Measure throughput improvements
Reference: See Triton documentation
PR4: Ray Deployment
Objective : Deploy with Ray ServeRequirements:
Create Ray Serve deployment
Configure replicas and resources
Implement model batching
Test auto-scaling behavior
Example: from ray import serve
@serve.deployment ( num_replicas = 2 )
class ModelDeployment :
def __init__ ( self ):
self .predictor = Predictor.default_from_model_registry()
async def __call__ ( self , request ):
text = await request.json()
predictions = self .predictor.predict(text[ "instances" ])
return { "predictions" : predictions.tolist()}
PR5: LLM Deployment with vLLM (Optional)
Objective : Serve LLMs with vLLM and LoRA adaptersRequirements:
Deploy vLLM server with base model
Implement adapter loading client
Create K8s manifest with GPU support
Document adapter management workflow
Reference: See vLLM documentation
PR6: Modal Deployment (Optional)
Objective : Deploy LLM on Modal serverless platformRequirements:
Create Modal app definition
Configure GPU resources
Implement API endpoint
Compare cost vs K8s deployment
Example: import modal
stub = modal.Stub( "llm-inference" )
@stub.function (
gpu = "A10G" ,
image = modal.Image.debian_slim().pip_install( "vllm" )
)
def generate ( prompt : str ) -> str :
from vllm import LLM
llm = LLM( "microsoft/Phi-3-mini-4k-instruct" )
outputs = llm.generate([prompt])
return outputs[ 0 ].outputs[ 0 ].text
Google Doc: Comparison Analysis
Objective : Compare serving solutions and justify choiceInclude:
Feature comparison table
Performance benchmarks (latency, throughput)
Cost analysis (infrastructure, maintenance)
Operational complexity
Scaling characteristics
Final recommendation with justification
Comparison dimensions:
Setup complexity
Performance (GPU utilization, latency)
Scalability (autoscaling, multi-model)
Monitoring and observability
Ecosystem and community support
Success Criteria
6 PRs merged (4 required + 2 optional)
All inference servers deploy successfully
Tests pass for each implementation
Google doc includes comprehensive comparison
Final serving solution chosen with justification
Testing Checklist
API Testing
import pytest
from fastapi.testclient import TestClient
def test_health_check ():
"""Verify service is running"""
response = client.get( "/health_check" )
assert response.status_code == 200
def test_predict_single ():
"""Test single prediction"""
response = client.post( "/predict" , json = { "text" : [ "test" ]})
assert response.status_code == 200
assert "probs" in response.json()
def test_predict_batch ():
"""Test batch prediction"""
response = client.post( "/predict" , json = { "text" : [ "test1" , "test2" ]})
assert len (response.json()[ "probs" ]) == 2
def test_invalid_input ():
"""Test error handling"""
response = client.post( "/predict" , json = { "invalid" : "data" })
assert response.status_code == 422
Kubernetes Testing
# Deployment health
kubectl get deployments
kubectl describe deployment app-fastapi
# Pod status
kubectl get pods -l app=app-fastapi
kubectl logs -l app=app-fastapi
# Service connectivity
kubectl get services
kubectl port-forward svc/app-fastapi 8080:8080
curl http://localhost:8080/health_check
# Resource usage
kubectl top pods -l app=app-fastapi
import time
import statistics
def benchmark_latency ( endpoint : str , n_requests : int = 100 ):
latencies = []
for _ in range (n_requests):
start = time.time()
response = requests.post(endpoint, json = { "text" : [ "test" ]})
latencies.append(time.time() - start)
print ( f "Mean latency: { statistics.mean(latencies) :.3f} s" )
print ( f "P95 latency: { statistics.quantiles(latencies, n = 20 )[ 18 ] :.3f} s" )
print ( f "P99 latency: { statistics.quantiles(latencies, n = 100 )[ 98 ] :.3f} s" )
Common Issues
Symptoms: Container crashes on startupSolutions:
Check W&B credentials: kubectl get secret wandb -o yaml
Verify model path: kubectl exec <pod> -- ls /tmp/model
Increase memory limits in deployment
Check logs: kubectl logs <pod>
Symptoms: High latency (>1s for small inputs)Solutions:
Enable batching in inference server
Add GPU resources to deployment
Use model quantization (INT8)
Implement model caching
Check CPU/memory throttling
Symptoms: Cannot connect to serviceSolutions:
Verify service exists: kubectl get svc
Check pod is running: kubectl get pods
Use correct service port: Check manifest
Try different local port: kubectl port-forward svc/app 8081:8080
Submission Guidelines
Code Quality
All tests pass locally and in CI
Code follows project style (ruff format)
No secrets committed to repository
Dockerfiles build successfully
Documentation
README explains how to run each service
Kubernetes manifests have descriptive comments
Google doc includes architecture diagrams
API endpoints documented with examples
Pull Requests
Title format: [module-5] <description>
PR description explains changes
Screenshots of running services
Links to deployed endpoints (if applicable)
Resources
Documentation
Examples
Next Steps
Module 6: Monitoring Learn to monitor models in production with metrics and alerts