Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/nba-data-preprocessing/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The data ingestion stage is the entry point of the NBA preprocessing pipeline. TheDataIngestor class handles loading data from multiple sources (CSV files, paths, or in-memory DataFrames) and provides streaming capabilities for large datasets.
DataIngestor Class
Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/ingestion/loader.py:19
Initialization
Random seed for reproducibility across data loading operations
Core Methods
load()
Loads data from various source types into a pandas DataFrame.Data source - can be a file path (string), Path object, or existing DataFrame
When loading from a DataFrame, the method returns a copy to prevent unintended mutations of the original data.
stream_chunks()
Streams data in configurable chunk sizes for memory-efficient processing of large datasets.Data source to stream from
Number of rows per chunk
- DataFrame source: Splits the DataFrame using
ilocwith the specified chunk size - File source: Uses pandas
read_csvwithchunksizeparameter for efficient streaming - Each chunk is returned as a copy to ensure isolation
fingerprint()
Generates a cryptographic fingerprint of the dataset for versioning and reproducibility tracking.Data source to fingerprint
DatasetFingerprint object containing:
path(str): Source path or ‘<in-memory>’ for DataFramessha256(str): SHA-256 hash of the CSV representationrows(int): Number of rows in the datasetcolumns(int): Number of columns in the dataset
- Version Control: Track dataset changes across pipeline runs
- Reproducibility: Verify that the same input data is used
- Data Integrity: Detect accidental modifications or corruption
DatasetFingerprint
Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/ingestion/loader.py:11
File path or ‘<in-memory>’ for DataFrames
SHA-256 hash of the CSV-encoded dataset
Total number of rows
Total number of columns
Data Flow
The ingestion stage follows this flow:Integration with Pipeline
The ingestion stage integrates with the streaming engine: Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:36
The ingestion stage is source-agnostic - the same API works for files, paths, and DataFrames.
Performance Considerations
Memory Efficiency
- Batch mode: Loads entire dataset into memory
- Streaming mode: Processes data in chunks, keeping only one chunk in memory at a time
When to Use Each Mode
| Mode | Best For | Memory Usage |
|---|---|---|
Batch (load) | Small to medium datasets (<1GB) | High |
Streaming (stream_chunks) | Large datasets or memory-constrained environments | Low |
Next Steps
Preprocessing
Clean and transform the ingested data
Streaming Engine
Learn about real-time pipeline execution