Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/kyryl-opens-ml/ml-in-production-practice/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This module includes two practice sections with multiple deliverables. Complete all tasks to demonstrate mastery of data management for ML production systems.

H3: Data Storage & Processing

Learning Goals

  • Deploy MinIO with multiple configuration options
  • Implement and test Python storage clients
  • Benchmark data format performance
  • Optimize inference with parallel processing
  • Create streaming datasets
  • Build vector databases for RAG

Reading List

Tasks

1

PR1: MinIO Deployment

Write comprehensive README instructions for deploying MinIO:Requirements:
  • Local installation steps
  • Docker deployment with docker run
  • Kubernetes deployment with manifests
  • Port forwarding instructions
  • Access credentials and UI setup
Deliverable: minio_storage/README.mdReference: See Storage documentation
2

PR2: MinIO Python Client

Develop CRUD client with comprehensive tests:Requirements:
  • Implement both native MinIO and S3FS clients
  • Create, read, update, delete operations
  • Environment-based configuration
  • Pytest fixtures for testing
  • Test upload/download functionality
Files:
  • minio_storage/minio_client.py
  • minio_storage/test_minio_client.py
Run tests:
pytest -ss ./minio_storage/test_minio_client.py
Reference: See Python Client implementation
3

PR3: Data Format Benchmarks

Benchmark Pandas storage formats:Requirements:
  • Test CSV, Parquet, Feather, HDF5
  • Measure save time, load time, file size
  • Create visualization of results
  • Document findings in README
  • Recommend format for different use cases
Metrics to measure:
  • Write time (seconds)
  • Read time (seconds)
  • File size (MB)
  • Compression ratio
Deliverable: processing/format_benchmark.pyReference: See Format Comparison
4

PR4: Inference Benchmarks

Benchmark parallel inference performance:Requirements:
  • Single worker baseline
  • ThreadPoolExecutor implementation
  • ProcessPoolExecutor implementation
  • Ray distributed processing
  • Performance comparison table
Expected results table:
MethodTime (s)Speedup
SingleX.XX1.0x
ThreadX.XXY.Yx
ProcessX.XXY.Yx
RayX.XXY.Yx
Run benchmarks:
python processing/inference_example.py run-single-worker --inference-size 10000000
python processing/inference_example.py run-pool --inference-size 10000000
python processing/inference_example.py run-ray --inference-size 10000000
Reference: See Inference Performance
5

PR5: Streaming Dataset (Optional)

Convert your dataset to streaming format:Requirements:
  • Choose format (MDS, WebDataset, TFRecord)
  • Implement data writer
  • Upload to S3/MinIO
  • Create DataLoader for reading
  • Benchmark loading speed
Example:
python streaming-dataset/convert_data.py create \
  --input-path ./raw_data \
  --output-path ./streaming_data

aws s3 cp --recursive ./streaming_data s3://datasets/my-data

python streaming-dataset/convert_data.py test \
  --remote s3://datasets/my-data
Reference: See Streaming Datasets
6

PR6: Vector Database

Transform dataset to vector format and implement RAG:Requirements:
  • Convert text data to embeddings
  • Create LanceDB/Chroma database
  • Implement ingestion pipeline
  • Build query interface
  • Benchmark query latency
CLI commands:
# Create database
python vector-db/my_rag.py create \
  --data-path ./data \
  --table-name my_vectors \
  --num-documents 1000

# Query database
python vector-db/my_rag.py query \
  --table-name my_vectors \
  --query "your search query" \
  --top-k 5
Reference: See Vector Databases
7

Google Doc: Data Section

Update your design document:Required sections:
  1. Data Description
    • Dataset source and size
    • Features and labels
    • Data splits (train/val/test)
  2. Storage Strategy
    • Storage backend (S3/MinIO)
    • Data format choice (with justification)
    • Versioning approach (DVC)
  3. Processing Pipeline
    • Data loading strategy
    • Preprocessing steps
    • Performance optimizations
    • Streaming vs batch
  4. Infrastructure
    • Storage capacity needed
    • Compute requirements
    • Cost estimates
Template: Design Doc Template

Success Criteria

H4: Data Labeling & Validation

Learning Goals

  • Write effective labeling guidelines
  • Deploy annotation tools
  • Generate synthetic training data
  • Validate data quality
  • Version control datasets

Reading List

Tasks

1

Google Doc: Labeling Section

Add comprehensive labeling documentation:1. Labeling Guidelines
  • Task definition and objectives
  • Label definitions with examples
  • Edge case handling
  • Quality check procedures
  • Decision flowchart
2. Cost & Time Estimation
  • Label 50 sample manually
  • Calculate time per sample
  • Estimate total time needed
  • Compute labeling budget
  • Include calculation methodology
3. Production Workflow
  • Data sampling strategy
  • Annotation tool setup
  • Quality assurance process
  • Active learning integration
  • Feedback collection
Example calculation:
Pilot: 50 samples in 30 minutes = 36 seconds/sample
Dataset: 10,000 samples
Time: 10,000 × 36s = 100 hours
Cost: 100 hours × $15/hour = $1,500
With QC (20%): $1,800
Reference: See Cost Estimation
2

PR1: DVC Dataset Versioning

Commit data with DVC:Requirements:
  • Initialize DVC in repository
  • Add dataset files to DVC tracking
  • Configure MinIO/S3 as remote
  • Push data to remote storage
  • Document workflow in README
Commands:
# Initialize
dvc init --subdir

# Track data
dvc add ./data/dataset.csv
git add data/.gitignore data/dataset.csv.dvc

# Configure remote
dvc remote add -d storage s3://ml-data
dvc remote modify storage endpointurl $AWS_ENDPOINT_URL

# Push
dvc push
Reference: See Dataset Versioning
3

PR2: Labeling Tool Deployment

Deploy Argilla or Label Studio:Requirements:
  • Choose tool (Argilla recommended)
  • Create deployment configuration
  • Docker/K8s deployment instructions
  • Access and authentication setup
  • Dataset creation example
Argilla Docker:
docker run -it --rm --name argilla -p 6900:6900 \
  argilla/argilla-quickstart:v2.0.0rc1
Files:
  • labeling/docker-compose.yml or labeling/k8s-manifest.yaml
  • labeling/README.md
  • labeling/create_dataset.py
Reference: See Argilla Deployment
4

PR3: Synthetic Dataset (Optional)

Generate synthetic data with GPT:Requirements:
  • Design generation prompt
  • Implement retry logic
  • Validate generated samples
  • Upload to labeling tool
  • Compare with real data
Example:
def generate_sample(prompt_template: str) -> Dict:
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Generate 1000 samples
samples = [generate_sample(template) for _ in range(1000)]
Reference: See Synthetic Data Generation
5

PR4: Data Validation (Optional)

Test data quality with Cleanlab or Deepchecks:Requirements:
  • Load labeled dataset
  • Run integrity checks
  • Identify label issues
  • Generate validation report
  • Document findings
Cleanlab example:
from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=model)
cl.fit(X_train, labels)

issues = cl.get_label_issues()
print(f"Found {len(issues)} potential errors")

# Save report
issues.to_csv("label_issues.csv")
Deepchecks example:
from deepchecks.tabular.suites import data_integrity

suite = data_integrity()
result = suite.run(dataset)
result.save_as_html("validation_report.html")
Reference: See Data Validation

Success Criteria

Submission

Code Requirements

  • Formatting: Use ruff format for Python code
  • Linting: Pass ruff check with no errors
  • Testing: Run pytest from repository root
  • Documentation: Include README in each directory

Pull Request Format

Title: [module-2] <concise description> Example: [module-2] Add MinIO client with S3FS support Body should include:
  • Summary of changes
  • How to test
  • Performance results (for benchmarks)
  • Screenshots (for UI/deployment)

Google Doc Requirements

Your design document should include:
  1. Data Section (H3 deliverable)
    • Dataset description
    • Storage architecture
    • Processing pipeline
    • Performance benchmarks
  2. Labeling Section (H4 deliverable)
    • Labeling guidelines
    • Cost/time estimates
    • Production workflow
    • Quality assurance plan

Resources

Reference Implementations

All source code available at:
~/workspace/source/module-2/
├── minio_storage/
├── processing/
├── streaming-dataset/
├── vector-db/
└── labeling/

Getting Help

  • Documentation: Review module pages
  • Code Examples: Check source directory
  • Community: Ask in course discussion forum
  • Office Hours: Attend weekly sessions

Next Steps

After completing Module 2:
  1. Ensure all PRs are merged
  2. Verify Google Doc is complete
  3. Proceed to Module 3: Model Training
Module 2 builds the data foundation for your ML system. Take time to understand storage, formats, and labeling deeply - these decisions impact every downstream component.

Build docs developers (and LLMs) love