Documentation Index
Fetch the complete documentation index at: https://mintlify.com/kyryl-opens-ml/ml-in-production-practice/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This module includes two practice sections with multiple deliverables. Complete all tasks to demonstrate mastery of data management for ML production systems.H3: Data Storage & Processing
Learning Goals
- Deploy MinIO with multiple configuration options
- Implement and test Python storage clients
- Benchmark data format performance
- Optimize inference with parallel processing
- Create streaming datasets
- Build vector databases for RAG
Reading List
Essential Reading
Essential Reading
Advanced Reading
Advanced Reading
Deep Dives
Deep Dives
Tasks
PR1: MinIO Deployment
Write comprehensive README instructions for deploying MinIO:Requirements:
- Local installation steps
- Docker deployment with docker run
- Kubernetes deployment with manifests
- Port forwarding instructions
- Access credentials and UI setup
minio_storage/README.mdReference: See Storage documentationPR2: MinIO Python Client
Develop CRUD client with comprehensive tests:Requirements:Reference: See Python Client implementation
- Implement both native MinIO and S3FS clients
- Create, read, update, delete operations
- Environment-based configuration
- Pytest fixtures for testing
- Test upload/download functionality
minio_storage/minio_client.pyminio_storage/test_minio_client.py
PR3: Data Format Benchmarks
Benchmark Pandas storage formats:Requirements:
- Test CSV, Parquet, Feather, HDF5
- Measure save time, load time, file size
- Create visualization of results
- Document findings in README
- Recommend format for different use cases
- Write time (seconds)
- Read time (seconds)
- File size (MB)
- Compression ratio
processing/format_benchmark.pyReference: See Format ComparisonPR4: Inference Benchmarks
Benchmark parallel inference performance:Requirements:
Run benchmarks:Reference: See Inference Performance
- Single worker baseline
- ThreadPoolExecutor implementation
- ProcessPoolExecutor implementation
- Ray distributed processing
- Performance comparison table
| Method | Time (s) | Speedup |
|---|---|---|
| Single | X.XX | 1.0x |
| Thread | X.XX | Y.Yx |
| Process | X.XX | Y.Yx |
| Ray | X.XX | Y.Yx |
PR5: Streaming Dataset (Optional)
Convert your dataset to streaming format:Requirements:Reference: See Streaming Datasets
- Choose format (MDS, WebDataset, TFRecord)
- Implement data writer
- Upload to S3/MinIO
- Create DataLoader for reading
- Benchmark loading speed
PR6: Vector Database
Transform dataset to vector format and implement RAG:Requirements:Reference: See Vector Databases
- Convert text data to embeddings
- Create LanceDB/Chroma database
- Implement ingestion pipeline
- Build query interface
- Benchmark query latency
Google Doc: Data Section
Update your design document:Required sections:
- Data Description
- Dataset source and size
- Features and labels
- Data splits (train/val/test)
- Storage Strategy
- Storage backend (S3/MinIO)
- Data format choice (with justification)
- Versioning approach (DVC)
- Processing Pipeline
- Data loading strategy
- Preprocessing steps
- Performance optimizations
- Streaming vs batch
- Infrastructure
- Storage capacity needed
- Compute requirements
- Cost estimates
Success Criteria
H4: Data Labeling & Validation
Learning Goals
- Write effective labeling guidelines
- Deploy annotation tools
- Generate synthetic training data
- Validate data quality
- Version control datasets
Reading List
Labeling Best Practices
Labeling Best Practices
Validation
Validation
Synthetic Data
Synthetic Data
Tasks
Google Doc: Labeling Section
Add comprehensive labeling documentation:1. Labeling GuidelinesReference: See Cost Estimation
- Task definition and objectives
- Label definitions with examples
- Edge case handling
- Quality check procedures
- Decision flowchart
- Label 50 sample manually
- Calculate time per sample
- Estimate total time needed
- Compute labeling budget
- Include calculation methodology
- Data sampling strategy
- Annotation tool setup
- Quality assurance process
- Active learning integration
- Feedback collection
PR1: DVC Dataset Versioning
Commit data with DVC:Requirements:Reference: See Dataset Versioning
- Initialize DVC in repository
- Add dataset files to DVC tracking
- Configure MinIO/S3 as remote
- Push data to remote storage
- Document workflow in README
PR2: Labeling Tool Deployment
Deploy Argilla or Label Studio:Requirements:Files:
- Choose tool (Argilla recommended)
- Create deployment configuration
- Docker/K8s deployment instructions
- Access and authentication setup
- Dataset creation example
labeling/docker-compose.ymlorlabeling/k8s-manifest.yamllabeling/README.mdlabeling/create_dataset.py
PR3: Synthetic Dataset (Optional)
Generate synthetic data with GPT:Requirements:Reference: See Synthetic Data Generation
- Design generation prompt
- Implement retry logic
- Validate generated samples
- Upload to labeling tool
- Compare with real data
PR4: Data Validation (Optional)
Test data quality with Cleanlab or Deepchecks:Requirements:Deepchecks example:Reference: See Data Validation
- Load labeled dataset
- Run integrity checks
- Identify label issues
- Generate validation report
- Document findings
Success Criteria
Submission
Code Requirements
- Formatting: Use
ruff formatfor Python code - Linting: Pass
ruff checkwith no errors - Testing: Run
pytestfrom repository root - Documentation: Include README in each directory
Pull Request Format
Title:[module-2] <concise description>
Example: [module-2] Add MinIO client with S3FS support
Body should include:
- Summary of changes
- How to test
- Performance results (for benchmarks)
- Screenshots (for UI/deployment)
Google Doc Requirements
Your design document should include:-
Data Section (H3 deliverable)
- Dataset description
- Storage architecture
- Processing pipeline
- Performance benchmarks
-
Labeling Section (H4 deliverable)
- Labeling guidelines
- Cost/time estimates
- Production workflow
- Quality assurance plan
Resources
Reference Implementations
All source code available at:Getting Help
- Documentation: Review module pages
- Code Examples: Check source directory
- Community: Ask in course discussion forum
- Office Hours: Attend weekly sessions
Next Steps
After completing Module 2:- Ensure all PRs are merged
- Verify Google Doc is complete
- Proceed to Module 3: Model Training
Module 2 builds the data foundation for your ML system. Take time to understand storage, formats, and labeling deeply - these decisions impact every downstream component.