Module 5 Practice

Overview

This module includes two homework assignments focused on deploying ML models through various serving approaches:

H9: API and UI serving with FastAPI, Streamlit, and Gradio
H10: Inference servers with Seldon, KServe, Triton, Ray, and vLLM

H9: API Serving

Learning Objectives

REST APIs

Build production-ready APIs with FastAPI

Web UIs

Create interactive interfaces with Streamlit/Gradio

Testing

Write comprehensive integration tests

Kubernetes

Deploy services to K8s with proper manifests

Reading List

Core Concepts

API Design

UI Frameworks

Deployment

Kubernetes Deployment Strategies

Tasks

PR1: Streamlit UI

Objective: Create an interactive web UI for your modelRequirements:

Single prediction interface with text input
Batch prediction with CSV upload
Unit tests for both interfaces
CI integration (pytest in GitHub Actions)

Reference implementation:

serving/ui_app.py

import streamlit as st
from serving.predictor import Predictor

@st.cache_data
def get_model():
    return Predictor.default_from_model_registry()

def single_pred():
    input_sent = st.text_input("Type english sentence")
    if st.button("Run inference"):
        pred = predictor.predict([input_sent])
        st.write("Pred:", pred)

Testing:

from streamlit.testing.v1 import AppTest

def test_single_prediction():
    at = AppTest.from_file("serving/ui_app.py")
    at.run()
    at.text_input[0].set_value("test").run()
    at.button[0].click().run()
    assert "Pred:" in at.text[0].value

PR2: Gradio UI

Objective: Build alternative UI with GradioRequirements:

Similar functionality to Streamlit
Component-based interface
Tests with gr.Interface.test_launch()
CI integration

Example:

import gradio as gr
from serving.predictor import Predictor

predictor = Predictor.default_from_model_registry()

def predict(text):
    return predictor.predict([text])[0].tolist()

interface = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(label="Input text"),
    outputs=gr.Label(label="Predictions")
)

if __name__ == "__main__":
    interface.launch()

PR3: FastAPI Server

Objective: Implement production-ready REST APIRequirements:

Pydantic models for validation
/health_check endpoint
/predict endpoint with batch support
Comprehensive tests with TestClient
CI integration

Reference:

serving/fast_api.py

from fastapi import FastAPI
from pydantic import BaseModel

class Payload(BaseModel):
    text: List[str]

app = FastAPI()

@app.get("/health_check")
def health_check() -> str:
    return "ok"

@app.post("/predict")
def predict(payload: Payload):
    prediction = predictor.predict(text=payload.text)
    return {"probs": prediction.tolist()}

Testing:

tests/test_fast_api.py

from fastapi.testclient import TestClient

def test_predict():
    response = client.post("/predict", json={"text": ["test"]})
    assert response.status_code == 200
    assert len(response.json()["probs"][0]) == 2

PR4: API Kubernetes Deployment

Objective: Deploy FastAPI to KubernetesRequirements:

Deployment manifest with 2+ replicas
Service manifest (ClusterIP)
ConfigMaps for configuration
Secrets for API keys (W&B)
Resource limits/requests

Example:

k8s/app-fastapi.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-fastapi
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: app-fastapi
        image: your-registry/app-fastapi:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

PR5: UI Kubernetes Deployment

Objective: Deploy Streamlit/Gradio to KubernetesRequirements:

Deployment manifest (single replica for session state)
Service manifest
Ingress configuration (optional)
Health checks

Example:

k8s/app-streamlit.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-streamlit
spec:
  replicas: 1  # Single replica for session state
  template:
    spec:
      containers:
      - name: app-streamlit
        image: your-registry/app-streamlit:latest
        livenessProbe:
          httpGet:
            path: /_stcore/health
            port: 8080

Google Doc Update

Objective: Document model serving planInclude:

API design decisions (endpoints, formats)
UI/UX considerations
Deployment architecture
Scaling strategy
Monitoring plan
Tradeoffs between serving options

Success Criteria

5 PRs merged with passing CI
All tests pass (pytest, API tests, UI tests)
Deployments run successfully on K8s
Google doc includes serving architecture

H10: Inference Servers

Learning Objectives

Production Serving

Deploy with Seldon, KServe, and Triton

Performance

Optimize throughput with batching and GPUs

LLM Serving

Serve LLMs with vLLM and LoRA adapters

Comparison

Evaluate tradeoffs between solutions

Reading List

Inference Servers

Cloud Platforms

LLM Serving

Edge Deployment

Machine Learning Systems with TinyML

Tasks

PR1: Seldon API Deployment

Objective: Deploy model with Seldon CoreRequirements:

Implement Seldon protocol wrapper
Create SeldonDeployment manifest
Write integration tests
Document comparison with vanilla K8s deployment

Example:

serving/seldon_api.py

class SeldonModel:
    def __init__(self):
        self.predictor = Predictor.default_from_model_registry()
    
    def predict(self, X, features_names=None):
        # X is numpy array or list
        predictions = self.predictor.predict(X)
        return predictions

PR2: KServe API Integration

Objective: Deploy with KServe InferenceServiceRequirements:

Implement KServe Model class
Create InferenceService manifest
Test V1/V2 inference protocol
Configure autoscaling

Reference: See KServe documentation

PR3: Triton Inference Server

Objective: Deploy with NVIDIA TritonRequirements:

Implement PyTriton wrapper
Configure dynamic batching
Create model configuration
Write client tests
Measure throughput improvements

Reference: See Triton documentation

PR4: Ray Deployment

Objective: Deploy with Ray ServeRequirements:

Create Ray Serve deployment
Configure replicas and resources
Implement model batching
Test auto-scaling behavior

Example:

from ray import serve

@serve.deployment(num_replicas=2)
class ModelDeployment:
    def __init__(self):
        self.predictor = Predictor.default_from_model_registry()
    
    async def __call__(self, request):
        text = await request.json()
        predictions = self.predictor.predict(text["instances"])
        return {"predictions": predictions.tolist()}

PR5: LLM Deployment with vLLM (Optional)

Objective: Serve LLMs with vLLM and LoRA adaptersRequirements:

Deploy vLLM server with base model
Implement adapter loading client
Create K8s manifest with GPU support
Document adapter management workflow

Reference: See vLLM documentation

PR6: Modal Deployment (Optional)

Objective: Deploy LLM on Modal serverless platformRequirements:

Create Modal app definition
Configure GPU resources
Implement API endpoint
Compare cost vs K8s deployment

Example:

import modal

stub = modal.Stub("llm-inference")

@stub.function(
    gpu="A10G",
    image=modal.Image.debian_slim().pip_install("vllm")
)
def generate(prompt: str) -> str:
    from vllm import LLM
    llm = LLM("microsoft/Phi-3-mini-4k-instruct")
    outputs = llm.generate([prompt])
    return outputs[0].outputs[0].text

Google Doc: Comparison Analysis

Objective: Compare serving solutions and justify choiceInclude:

Feature comparison table
Performance benchmarks (latency, throughput)
Cost analysis (infrastructure, maintenance)
Operational complexity
Scaling characteristics
Final recommendation with justification

Comparison dimensions:

Setup complexity
Performance (GPU utilization, latency)
Scalability (autoscaling, multi-model)
Monitoring and observability
Ecosystem and community support

Success Criteria

6 PRs merged (4 required + 2 optional)
All inference servers deploy successfully
Tests pass for each implementation
Google doc includes comprehensive comparison
Final serving solution chosen with justification

Testing Checklist

API Testing

tests/test_endpoints.py

import pytest
from fastapi.testclient import TestClient

def test_health_check():
    """Verify service is running"""
    response = client.get("/health_check")
    assert response.status_code == 200

def test_predict_single():
    """Test single prediction"""
    response = client.post("/predict", json={"text": ["test"]})
    assert response.status_code == 200
    assert "probs" in response.json()

def test_predict_batch():
    """Test batch prediction"""
    response = client.post("/predict", json={"text": ["test1", "test2"]})
    assert len(response.json()["probs"]) == 2

def test_invalid_input():
    """Test error handling"""
    response = client.post("/predict", json={"invalid": "data"})
    assert response.status_code == 422

Kubernetes Testing

# Deployment health
kubectl get deployments
kubectl describe deployment app-fastapi

# Pod status
kubectl get pods -l app=app-fastapi
kubectl logs -l app=app-fastapi

# Service connectivity
kubectl get services
kubectl port-forward svc/app-fastapi 8080:8080
curl http://localhost:8080/health_check

# Resource usage
kubectl top pods -l app=app-fastapi

Performance Testing

import time
import statistics

def benchmark_latency(endpoint: str, n_requests: int = 100):
    latencies = []
    for _ in range(n_requests):
        start = time.time()
        response = requests.post(endpoint, json={"text": ["test"]})
        latencies.append(time.time() - start)
    
    print(f"Mean latency: {statistics.mean(latencies):.3f}s")
    print(f"P95 latency: {statistics.quantiles(latencies, n=20)[18]:.3f}s")
    print(f"P99 latency: {statistics.quantiles(latencies, n=100)[98]:.3f}s")

Common Issues

Model loading fails

Symptoms: Container crashes on startupSolutions:

Check W&B credentials: kubectl get secret wandb -o yaml
Verify model path: kubectl exec <pod> -- ls /tmp/model
Increase memory limits in deployment
Check logs: kubectl logs <pod>

Predictions are slow

Symptoms: High latency (>1s for small inputs)Solutions:

Enable batching in inference server
Add GPU resources to deployment
Use model quantization (INT8)
Implement model caching
Check CPU/memory throttling

Port forwarding fails

Symptoms: Cannot connect to serviceSolutions:

Verify service exists: kubectl get svc
Check pod is running: kubectl get pods
Use correct service port: Check manifest
Try different local port: kubectl port-forward svc/app 8081:8080

Submission Guidelines

Code Quality

All tests pass locally and in CI
Code follows project style (ruff format)
No secrets committed to repository
Dockerfiles build successfully

Documentation

README explains how to run each service
Kubernetes manifests have descriptive comments
Google doc includes architecture diagrams
API endpoints documented with examples

Pull Requests

Title format: [module-5] <description>
PR description explains changes
Screenshots of running services
Links to deployed endpoints (if applicable)

Resources

Documentation

Examples

Next Steps

Module 6: Monitoring

Learn to monitor models in production with metrics and alerts

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Overview

H9: API Serving

Learning Objectives

REST APIs

Web UIs

Testing

Kubernetes

Reading List

Tasks

Success Criteria

H10: Inference Servers

Learning Objectives

Production Serving

Performance

LLM Serving

Comparison

Reading List

Tasks

Success Criteria

Testing Checklist

API Testing

Kubernetes Testing

Performance Testing

Common Issues

Submission Guidelines

Resources

Documentation

Examples

Next Steps

Module 6: Monitoring

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Documentation Index

​Overview

​H9: API Serving

​Learning Objectives

REST APIs

Web UIs

Testing

Kubernetes

​Reading List

​Tasks

​Success Criteria

​H10: Inference Servers

​Learning Objectives

Production Serving

Performance

LLM Serving

Comparison

​Reading List

​Tasks

​Success Criteria

​Testing Checklist

​API Testing

​Kubernetes Testing

​Performance Testing

​Common Issues

​Submission Guidelines

​Resources

​Documentation

​Examples

​Next Steps

Module 6: Monitoring

Build docs developers (and LLMs) love

Overview

H9: API Serving

Learning Objectives

Reading List

Tasks

Success Criteria

H10: Inference Servers

Learning Objectives

Reading List

Tasks

Success Criteria

Testing Checklist

API Testing

Kubernetes Testing

Performance Testing

Common Issues

Submission Guidelines

Resources

Documentation

Examples

Next Steps