Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/nba-data-preprocessing/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The PipelineConfig dataclass provides centralized configuration for all pipeline operations. It controls resource limits, reproducibility settings, and output behavior.

Class Definition

@dataclass(frozen=True)
class PipelineConfig
Source: ~/workspace/source/NBA Data Preprocessing/task/pipeline/config.py:8

Fields

random_seed
int
default:"42"
Random seed for reproducibility across all pipeline operations
chunk_size
int
default:"128"
Number of rows per chunk in streaming operations
batch_size
int
default:"256"
Batch size for model training and evaluation
n_jobs
int
default:"1"
Number of parallel jobs for constraint experiments. Set to -1 for all CPUs
max_memory_mb
int
default:"1024"
Maximum memory limit in megabytes for streaming operations
max_compute_units
float
default:"1.0"
Maximum compute units (0.0-1.0 scale) for resource-constrained execution
benchmark_runs
int
default:"5"
Number of runs for benchmark statistical analysis
adaptive_chunk_resize
bool
default:"True"
Enable automatic chunk size reduction when memory limits are exceeded
max_chunk_retries
int
default:"3"
Maximum retry attempts for processing a chunk before failure
spill_to_disk
bool
default:"False"
Enable spilling intermediate results to disk during streaming
output_dir
Path
default:"Path('artifacts')"
Root directory for all pipeline outputs

Methods

ensure_output_dirs

def ensure_output_dirs(self) -> None
Creates the output directory structure for pipeline artifacts.
return
None
Creates directories: output_dir, intermediate, reports, benchmarks, metadata, and profiles
Example:
from pathlib import Path
from pipeline.config import PipelineConfig

config = PipelineConfig(
    random_seed=42,
    chunk_size=256,
    max_memory_mb=2048,
    output_dir=Path('my_pipeline_outputs')
)

config.ensure_output_dirs()
# Creates:
# - my_pipeline_outputs/
# - my_pipeline_outputs/intermediate/
# - my_pipeline_outputs/reports/
# - my_pipeline_outputs/benchmarks/
# - my_pipeline_outputs/metadata/
# - my_pipeline_outputs/profiles/

Usage Examples

Basic Configuration

from pipeline.config import PipelineConfig

# Use default settings
config = PipelineConfig()

print(config.random_seed)  # 42
print(config.chunk_size)   # 128

Resource-Constrained Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Configure for low-memory environment
config = PipelineConfig(
    random_seed=123,
    chunk_size=64,
    batch_size=128,
    max_memory_mb=512,
    max_compute_units=0.5,
    adaptive_chunk_resize=True,
    output_dir=Path('limited_resources_run')
)

config.ensure_output_dirs()

High-Performance Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Configure for maximum performance
config = PipelineConfig(
    random_seed=42,
    chunk_size=512,
    batch_size=1024,
    n_jobs=-1,  # Use all CPU cores
    max_memory_mb=8192,
    max_compute_units=1.0,
    benchmark_runs=10,
    spill_to_disk=False,
    output_dir=Path('high_performance_run')
)

Debugging Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Enable disk spilling and adaptive resizing for troubleshooting
config = PipelineConfig(
    random_seed=42,
    chunk_size=128,
    adaptive_chunk_resize=True,
    max_chunk_retries=5,
    spill_to_disk=True,
    output_dir=Path('debug_run')
)

Notes

  • The dataclass is frozen (immutable) to ensure configuration consistency throughout pipeline execution
  • All pipeline components accept a PipelineConfig instance and respect its settings
  • Resource limits (max_memory_mb, max_compute_units) are soft limits that trigger adaptive behavior rather than hard failures
  • When adaptive_chunk_resize=True, the pipeline automatically reduces chunk sizes if memory limits are exceeded

Build docs developers (and LLMs) love