FastAPI Model Serving

Overview

FastAPI provides a modern, high-performance framework for building ML model APIs with automatic validation, documentation, and type safety.

Implementation

API Structure

The FastAPI server (serving/fast_api.py) implements two endpoints:

serving/fast_api.py

from fastapi import FastAPI
from pydantic import BaseModel
from serving.predictor import Predictor

class Payload(BaseModel):
    text: List[str]

class Prediction(BaseModel):
    probs: List[List[float]]

app = FastAPI()
predictor = Predictor.default_from_model_registry()

@app.get("/health_check")
def health_check() -> str:
    return "ok"

@app.post("/predict", response_model=Prediction)
def predict(payload: Payload) -> Prediction:
    prediction = predictor.predict(text=payload.text)
    return Prediction(probs=prediction.tolist())

Request/Response Models

{
  "text": ["good", "bad"]
}

Payload schema:

text: List of strings to classify
Validated by Pydantic at runtime
Automatic error messages for invalid input

Prediction schema:

probs: List of probability distributions
Each inner list sums to 1.0
Length matches number of input texts

API Endpoints

Health Check

GET /health_check

Purpose: Kubernetes liveness/readiness probes Response:

"ok"

Usage:

curl http://localhost:8080/health_check

Predict

POST /predict

Purpose: Classify text sequences Request body:

{
  "text": ["This is great!", "This is terrible."]
}

Response:

{
  "probs": [
    [0.05, 0.95],
    [0.92, 0.08]
  ]
}

Error handling:

422 Unprocessable Entity: Invalid input format
500 Internal Server Error: Model prediction failure

Testing

Tests use FastAPI’s TestClient for integration testing:

tests/test_fast_api.py

import pytest
from fastapi.testclient import TestClient
from serving.fast_api import app

client = TestClient(app)

def test_health_check():
    response = client.get("/health_check")
    assert response.status_code == 200
    assert response.json() == "ok"

def test_predict():
    response = client.post("/predict", json={"text": ["this is test"]})
    assert response.status_code == 200
    probs = response.json()["probs"][0]
    assert len(probs) == 2
    assert sum(probs) == pytest.approx(1.0)

Test coverage:

Health check endpoint
Prediction endpoint with validation
Probability distribution validation

Run tests:

pytest -ss ./tests

Local Development

Using Make

# Build and run
make run_fast_api

This:

Builds Docker image with app-fastapi target
Runs container on port 8081
Mounts W&B API key from environment

Using Docker Directly

# Build
docker build -f Dockerfile -t app-fastapi:latest --target app-fastapi .

# Run
docker run -it -p 8081:8080 \
  -e WANDB_API_KEY=${WANDB_API_KEY} \
  app-fastapi:latest

Manual Testing

# Test with sample data
curl -X POST -H "Content-Type: application/json" \
  -d @data-samples/samples.json \
  http://0.0.0.0:8080/predict

# Expected output
{
  "probs": [
    [0.23, 0.77],
    [0.89, 0.11]
  ]
}

Kubernetes Deployment

Manifest Structure

k8s/app-fastapi.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-fastapi
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-fastapi
  template:
    metadata:
      labels:
        app: app-fastapi
    spec:
      containers:
        - name: app-fastapi
          image: ghcr.io/kyryl-opens-ml/app-fastapi:latest
          env:
          - name: WANDB_API_KEY
            valueFrom:
              secretKeyRef:
                name: wandb
                key: WANDB_API_KEY
---
apiVersion: v1
kind: Service
metadata:
  name: app-fastapi
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: app-fastapi

Key configuration:

Replicas: 2 pods for high availability
Image: Pulled from GitHub Container Registry
Secrets: W&B API key from Kubernetes secret
Service: ClusterIP exposes port 8080

Deployment Steps

Create cluster

kind create cluster --name ml-in-production

Create secrets

export WANDB_API_KEY='your-key-here'
kubectl create secret generic wandb \
  --from-literal=WANDB_API_KEY=$WANDB_API_KEY

Deploy application

kubectl create -f k8s/app-fastapi.yaml

Verify deployment

kubectl get pods -l app=app-fastapi
kubectl logs -l app=app-fastapi

Port forward

kubectl port-forward --address 0.0.0.0 svc/app-fastapi 8080:8080

Testing in Kubernetes

# Health check
curl http://localhost:8080/health_check

# Prediction
curl -X POST -H "Content-Type: application/json" \
  -d '{"text": ["test input"]}' \
  http://localhost:8080/predict

Production Considerations

Performance

The model loads on startup. For faster cold starts, consider:

Model caching in persistent volumes
Init containers for model download
Warm-up requests after deployment

Optimization strategies:

Use uvicorn workers for concurrency
Enable model batching for throughput
Add Redis for response caching
Implement request queuing

Monitoring

Add observability with middleware:

from fastapi import Request
import time

@app.middleware("http")
async def add_process_time_header(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    return response

Metrics to track:

Request latency (p50, p95, p99)
Throughput (requests/second)
Error rate (4xx, 5xx)
Model inference time

Error Handling

Enhance error responses:

from fastapi import HTTPException

@app.post("/predict", response_model=Prediction)
def predict(payload: Payload) -> Prediction:
    try:
        prediction = predictor.predict(text=payload.text)
        return Prediction(probs=prediction.tolist())
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Prediction failed: {str(e)}"
        )

API Documentation

FastAPI automatically generates docs:

Swagger UI: http://localhost:8080/docs
ReDoc: http://localhost:8080/redoc
OpenAPI spec: http://localhost:8080/openapi.json

Best Practices

Validation

Use Pydantic models for all inputs/outputs

Versioning

Version APIs with path prefixes (/v1/predict)

Rate Limiting

Add slowapi for request throttling

Authentication

Implement API keys or OAuth for security

Comparison with Alternatives

Feature	FastAPI	Flask	Django
Performance	Excellent	Good	Moderate
Type Safety	Yes	No	Partial
Auto Docs	Yes	No	Partial
Async Support	Yes	Limited	Yes
Learning Curve	Low	Very Low	High

Next Steps

Streamlit UI

Build interactive web interfaces with Streamlit

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Overview

Implementation

API Structure

Request/Response Models

API Endpoints

Health Check

Predict

Testing

Local Development

Using Make

Using Docker Directly

Manual Testing

Kubernetes Deployment

Manifest Structure

Deployment Steps

Testing in Kubernetes

Production Considerations

Performance

Monitoring

Error Handling

API Documentation

Best Practices

Validation

Versioning

Rate Limiting

Authentication

Comparison with Alternatives

Next Steps

Streamlit UI

Resources

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Documentation Index

​Overview

​Implementation

​API Structure

​Request/Response Models

​API Endpoints

​Health Check

​Predict

​Testing

​Local Development

​Using Make

​Using Docker Directly

​Manual Testing

​Kubernetes Deployment

​Manifest Structure

​Deployment Steps

​Testing in Kubernetes

​Production Considerations

​Performance

​Monitoring

​Error Handling

​API Documentation

​Best Practices

Validation

Versioning

Rate Limiting

Authentication

​Comparison with Alternatives

​Next Steps

Streamlit UI

​Resources

Build docs developers (and LLMs) love

Overview

Implementation

API Structure

Request/Response Models

API Endpoints

Health Check

Predict

Testing

Local Development

Using Make

Using Docker Directly

Manual Testing

Kubernetes Deployment

Manifest Structure

Deployment Steps

Testing in Kubernetes

Production Considerations

Performance

Monitoring

Error Handling

API Documentation

Best Practices

Comparison with Alternatives

Next Steps

Resources