Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/nba-data-preprocessing/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The constraint experiment functionality systematically tests pipeline performance across combinations of chunk sizes, memory limits, and compute constraints. This enables:- Finding optimal configurations for constrained environments
- Understanding performance trade-offs
- Validating edge and low-resource scenarios
- Identifying Pareto-optimal points
Running Constraint Experiments
Programmatic Usage
Command-Line Usage
The constraint experiment runs automatically as part ofrun_all():
Experiment Methodology
Implementation
Fromengine.py:389-413:
Parameter Grid
The experiment tests all combinations of:-
Chunk sizes:
[64, config.chunk_size]- Minimum: 64 rows
- Maximum: Configured chunk size
- Duplicates removed
-
Memory limits:
[256, config.max_memory_mb]- Low-memory scenario: 256 MB
- Configured limit: User-specified
-
Compute limits:
[0.5, config.max_compute_units]- CPU-constrained: 50% utilization
- Full utilization: User-specified
Single Run Implementation
Fromengine.py:376-387:
Generated Artifacts
constraint_experiment.csv
Complete results matrix in output_dir/benchmarks/:
chunk_size: Streaming chunk size (rows)memory_limit_mb: Maximum memory constraintcompute_limit: CPU constraint factor (0.0-1.0)preprocessing_latency_s: Total preprocessing timepeak_memory_mb: Maximum memory usage observedtraining_time_s: Model training time (includes preprocessing)model_accuracy_r2: Regression R² scoremodel_rmse: Root mean squared error
constraint_experiment_log.jsonl
JSON Lines format for programmatic analysis in output_dir/reports/:
Visualization Plots
Generated inoutput_dir/benchmarks/ (see engine.py:509-555):
latency_vs_accuracy.png
Scatter plot showing trade-off between preprocessing speed and model quality:
- X-axis: Preprocessing latency (seconds)
- Y-axis: Model accuracy (R²)
- Color: Compute constraint level
memory_vs_accuracy.png
Memory consumption vs. model quality:
- X-axis: Peak memory (MB)
- Y-axis: Model accuracy (R²)
- Color: Memory limit setting
latency_memory_accuracy.png
Three-way relationship visualization:
- X-axis: Peak memory (MB)
- Y-axis: Preprocessing latency (seconds)
- Color: Model accuracy (R²)
Analyzing Results
Finding Optimal Configuration
Pareto Frontier Analysis
Identify configurations that aren’t strictly dominated:Memory-Constrained Scenarios
Filter results by memory availability:CPU-Constrained Scenarios
Edge Scenarios
Low-Memory Systems
From the hardware profiling documentation: Recommendations:- Enable
--spill-to-disk - Reduce
--chunk-size - Keep
--max-memory-mbrealistic for resident process limits
CPU-Constrained Systems
Recommendations:- Lower
--max-compute-units - Use smaller
--batch-size - Keep
--n-jobs 1to avoid contention
Minimal Resource Scenario
Combined memory and CPU constraints:Experiment Summary
The experiment results include a summary dictionary (engine.py:407-412):Parallel Execution
Constraint experiments can run in parallel (engine.py:397-402):n_jobs=1.
Best Practices
- Run experiments on representative data: Use production-scale samples
- Test edge cases separately: Minimal resource scenarios may need custom grids
- Validate constraints: Verify peak usage doesn’t exceed limits
- Document findings: Save experiment reports for comparison
- Use Pareto analysis: Identify optimal trade-offs, not just best single metric
- Consider deployment environment: Match constraints to target hardware
Limitations
- Fixed parameter grid: Only tests predefined combinations
- No hyperparameter tuning: Model parameters are fixed
- Sequential dependencies: Each run is independent (no warm-up effects)
- Coarse granularity: Limited to 2 values per parameter
Next Steps
- Benchmarking - Full statistical analysis
- Hardware Profiling - Operator-level details
- Optimization Strategies - Apply findings