Documentation Index Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/Data-Science-AI-Portfolio/llms.txt
Use this file to discover all available pages before exploring further.
Reproducibility is enforced through three mechanisms: global seed control, configuration-driven training, and cryptographic lineage tracking. Every training run produces verifiable artifacts that can be reproduced bit-for-bit.
Seed Control
All random operations are controlled by a global seed defined in config.yaml.
Configuration
Implementation
The seed is set globally before any random operations:
def set_global_seed ( seed : int ) -> None :
random.seed(seed)
np.random.seed(seed)
All training operations initialize from this seed:
def main () -> None :
config = load_config()
set_global_seed( int (config[ "seed" ]))
# All subsequent operations use seeded randomness
df = load_dataset(config)
X_train, X_test, y_train, y_test = split_data(df, config)
Seeded Components
The seed propagates to all stochastic operations:
Train/Test Split
Cross-Validation
Model Initialization
return train_test_split(
X,
y,
test_size = float (config[ "data" ][ "test_size" ]),
random_state = int (config[ "seed" ]),
stratify = y,
)
Configuration-Driven Training
All training behavior is controlled by declarative configuration, eliminating hard-coded parameters.
Complete Configuration
seed : 42
data :
path : ml_datasource.csv
target : purchased
test_size : 0.2
features :
epsilon : 1.0e-06
engagement :
minutes_watched_weight : 0.6
days_on_platform_weight : 0.3
courses_started_weight : 10.0
preprocessing :
outlier_factor : 1.5
numeric_imputer : median
categorical_imputer : most_frequent
models :
logistic_regression :
max_iter : 2000
knn :
n_neighbors : 7
svm :
C : 1.0
kernel : rbf
gamma : scale
decision_tree :
max_depth : 8
min_samples_leaf : 10
random_forest :
n_estimators : 400
min_samples_leaf : 2
cv :
n_splits : 5
business :
target_precision : 0.9
artifacts :
model_dir : artifacts
model_file : best_model.joblib
threshold_file : threshold.txt
metrics_file : metrics.json
drift_baseline_file : drift_baseline.json
lineage_file : lineage.json
Configuration Loading
Configuration is loaded once and passed through the pipeline:
def load_config ( path : str | Path = "config.yaml" ) -> dict :
with open (path, "r" , encoding = "utf-8" ) as f:
return yaml.safe_load(f)
Configuration Validation
The configuration structure is validated at runtime through type checking and bounds validation in model initialization.
Lineage Tracking
Every training run generates a lineage manifest with SHA256 hashes of all inputs and outputs.
Lineage Generation
The training script computes hashes of all artifacts:
def _sha256_bytes ( data : bytes ) -> str :
return hashlib.sha256(data).hexdigest()
def _sha256_file ( path : Path) -> str :
return _sha256_bytes(path.read_bytes())
# After training completes
run_id = str (uuid.uuid4())
config_hash = _sha256_file(Path( "config.yaml" ))
dataset_hash = _sha256_file(Path(config[ "data" ][ "path" ]))
model_hash = _sha256_file(model_path)
lineage = {
"run_id" : run_id,
"dataset" : {
"path" : config[ "data" ][ "path" ],
"sha256" : dataset_hash,
},
"config" : {
"path" : "config.yaml" ,
"sha256" : config_hash,
},
"model" : {
"path" : str (model_path),
"sha256" : model_hash,
},
"threshold" : {
"path" : str (threshold_path),
"sha256" : _sha256_file(threshold_path),
},
}
lineage_path.write_text(json.dumps(lineage, indent = 2 ), encoding = "utf-8" )
Lineage Structure
The lineage file (artifacts/lineage.json) records complete provenance:
{
"run_id" : "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b" ,
"dataset" : {
"path" : "ml_datasource.csv" ,
"sha256" : "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
},
"config" : {
"path" : "config.yaml" ,
"sha256" : "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5"
},
"model" : {
"path" : "artifacts/best_model.joblib" ,
"sha256" : "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2"
},
"threshold" : {
"path" : "artifacts/threshold.txt" ,
"sha256" : "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3"
}
}
Reproducibility Verification
The reproducibility check script validates that artifacts match their lineage hashes.
Running the Check
python scripts/reproducibility_check.py
Implementation
scripts/reproducibility_check.py
def sha256 ( path : Path) -> str :
return hashlib.sha256(path.read_bytes()).hexdigest()
def main () -> None :
lineage_path = Path( "artifacts/lineage.json" )
if not lineage_path.exists():
raise FileNotFoundError ( "Missing artifacts/lineage.json. Run training first." )
lineage = json.loads(lineage_path.read_text( encoding = "utf-8" ))
checks = {
"dataset" : (Path(lineage[ "dataset" ][ "path" ]), lineage[ "dataset" ][ "sha256" ]),
"config" : (Path(lineage[ "config" ][ "path" ]), lineage[ "config" ][ "sha256" ]),
"model" : (Path(lineage[ "model" ][ "path" ]), lineage[ "model" ][ "sha256" ]),
"threshold" : (Path(lineage[ "threshold" ][ "path" ]), lineage[ "threshold" ][ "sha256" ]),
}
report = { "run_id" : lineage.get( "run_id" ), "checks" : {}}
all_passed = True
for name, (path, expected) in checks.items():
actual = sha256(path)
passed = actual == expected
all_passed &= passed
report[ "checks" ][name] = {
"path" : str (path),
"expected" : expected,
"actual" : actual,
"passed" : passed,
}
report[ "passed" ] = all_passed
Path( "artifacts/reproducibility_report.json" ).write_text(json.dumps(report, indent = 2 ), encoding = "utf-8" )
print (json.dumps(report, indent = 2 ))
if not all_passed:
raise SystemExit ( 1 )
Verification Report
The script generates artifacts/reproducibility_report.json:
artifacts/reproducibility_report.json
{
"run_id" : "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b" ,
"checks" : {
"dataset" : {
"path" : "ml_datasource.csv" ,
"expected" : "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" ,
"actual" : "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" ,
"passed" : true
},
"config" : {
"path" : "config.yaml" ,
"expected" : "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5" ,
"actual" : "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5" ,
"passed" : true
},
"model" : {
"path" : "artifacts/best_model.joblib" ,
"expected" : "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2" ,
"actual" : "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2" ,
"passed" : true
},
"threshold" : {
"path" : "artifacts/threshold.txt" ,
"expected" : "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3" ,
"actual" : "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3" ,
"passed" : true
}
},
"passed" : true
}
Environment Tracking
The script also captures environment metadata:
scripts/reproducibility_check.py
environment = {
"python_version" : sys.version,
"platform" : platform.platform(),
"cwd" : str (Path.cwd()),
}
Path( "artifacts/reproducibility_environment.json" ).write_text(json.dumps(environment, indent = 2 ), encoding = "utf-8" )
Reproducibility Guarantees
Given identical inputs (dataset + config), training produces:
Identical model parameters (verified by SHA256)
Identical threshold values
Identical cross-validation splits
Identical metric values
Best Practices
Version Configuration
Commit config.yaml to version control alongside code
Track Dataset Versions
Store dataset hashes in config/datasets.yaml and validate before training
Verify Lineage
Run python scripts/reproducibility_check.py in CI to detect accidental changes
Archive Artifacts
Store lineage manifests with model artifacts for audit trails
Document Seed Changes
Changing the seed produces different models - document why in commit messages
CI Integration
Add reproducibility checks to CI pipelines:
- name : Train model
run : python -m src.train
- name : Verify reproducibility
run : python scripts/reproducibility_check.py
- name : Archive lineage
uses : actions/upload-artifact@v3
with :
name : lineage-${{ github.sha }}
path : artifacts/lineage.json
Limitations
Reproducibility is not guaranteed when:
Using non-deterministic hardware (GPU with non-deterministic ops)
Parallel execution order varies (set n_jobs=1 for strict reproducibility)
Python/library versions differ
System-level randomness is not controlled