Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/nba-data-preprocessing/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The pipeline includes two optimized configuration templates that represent common deployment scenarios. Templates are JSON files that can be loaded via the --config-template CLI argument.

Edge Template

Optimized for resource-constrained edge devices with limited memory and compute resources. Location: configs/pipeline.edge.template.json
{
  "random_seed": 42,
  "chunk_size": 64,
  "batch_size": 96,
  "n_jobs": 1,
  "max_memory_mb": 256,
  "max_compute_units": 0.4,
  "benchmark_runs": 3,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": true,
  "output_dir": "artifacts_edge"
}

Edge Template Characteristics

chunk_size
int
default:"64"
Small chunks minimize memory footprint for edge devices
batch_size
int
default:"96"
Conservative batch size to avoid memory exhaustion
n_jobs
int
default:"1"
Single-threaded execution to reduce overhead on limited cores
max_memory_mb
int
default:"256"
Strict 256MB memory limit for edge deployment
max_compute_units
float
default:"0.4"
Throttled to 40% to leave resources for other processes
benchmark_runs
int
default:"3"
Fewer benchmark iterations to reduce processing time
spill_to_disk
bool
default:"true"
Enabled - Critical for handling datasets larger than available RAM

Use Cases

  • Raspberry Pi or similar single-board computers
  • IoT devices with limited resources
  • Mobile or embedded systems
  • Environments where memory is <512MB

Server Template

Optimized for high-performance server environments with ample resources. Location: configs/pipeline.server.template.json
{
  "random_seed": 42,
  "chunk_size": 256,
  "batch_size": 512,
  "n_jobs": 4,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_server"
}

Server Template Characteristics

chunk_size
int
default:"256"
Large chunks maximize throughput on powerful hardware
batch_size
int
default:"512"
Large batches leverage vectorization for faster processing
n_jobs
int
default:"4"
Multi-threaded execution for parallel processing
max_memory_mb
int
default:"4096"
Generous 4GB memory allocation for complex operations
max_compute_units
float
default:"1.0"
Full compute resources available (100%)
benchmark_runs
int
default:"5"
More iterations for statistically robust benchmarks
spill_to_disk
bool
default:"false"
Disabled - Keep all data in memory for maximum performance

Use Cases

  • Cloud compute instances (AWS, GCP, Azure)
  • On-premise data processing servers
  • Development workstations
  • Environments with >8GB RAM

Using Templates

Load Template via CLI

python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json

Override Template Values

CLI arguments take precedence over template values:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json \
  --chunk-size 128 \
  --n-jobs 2
This loads the edge template but overrides chunk_size to 128 and n_jobs to 2.

Load Template in Python

import json
from pathlib import Path
from pipeline.config import PipelineConfig

# Load template
template_path = Path('configs/pipeline.server.template.json')
template_values = json.loads(template_path.read_text(encoding='utf-8'))

# Convert output_dir string to Path
template_values['output_dir'] = Path(template_values['output_dir'])

# Create config from template
config = PipelineConfig(**template_values)

Creating Custom Templates

You can create your own templates for specific environments:
{
  "random_seed": 42,
  "chunk_size": 192,
  "batch_size": 384,
  "n_jobs": 2,
  "max_memory_mb": 2048,
  "max_compute_units": 0.75,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_custom"
}
Save as JSON and load with --config-template path/to/your/template.json.

Template Selection Guide

CriteriaEdge TemplateServer Template
Available RAM<512MB>4GB
CPU Cores1-24+
Dataset Size<100MBAny size
PriorityResource efficiencyMaximum performance
Disk SpillingEnabledDisabled
Processing TimeSlower, conservativeFaster, aggressive

Next Steps

Configuration Overview

Learn about all configuration options

CLI Reference

See all command-line arguments

Build docs developers (and LLMs) love