Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/Data-Science-AI-Portfolio/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The monitoring system tracks prediction patterns in real-time to detect feature distribution drift and prediction rate shifts. When drift exceeds configured thresholds, the system recommends retraining the model.
Architecture
- In-Memory Tracking: Cumulative feature sums and prediction counts
- Thread-Safe: Lock-protected updates during concurrent requests
- Baseline Comparison: Z-score computation against training distribution
- Dual Triggers: Feature drift and prediction rate shift detection
Drift Detection Endpoint
GET /monitoring/drift
Returns current drift status and retraining recommendation.
Response Schema:
class DriftStatusResponse(BaseModel):
samples_observed: int
drift_score_max_abs_z: float
drifted_features: List[str]
predicted_positive_rate: float
training_positive_rate: float
should_retrain: bool
reason: str
recommended_action: str
Example Request:
curl http://localhost:8000/monitoring/drift
Example Response (No Drift):
{
"samples_observed": 245,
"drift_score_max_abs_z": 2.34,
"drifted_features": [],
"predicted_positive_rate": 0.328,
"training_positive_rate": 0.315,
"should_retrain": false,
"reason": "below_threshold",
"recommended_action": "continue_monitoring"
}
Example Response (Feature Drift Detected):
{
"samples_observed": 512,
"drift_score_max_abs_z": 4.76,
"drifted_features": ["days_on_platform", "minutes_watched"],
"predicted_positive_rate": 0.342,
"training_positive_rate": 0.315,
"should_retrain": true,
"reason": "feature_distribution_drift",
"recommended_action": "trigger_retraining_pipeline"
}
Example Response (Prediction Rate Shift):
{
"samples_observed": 423,
"drift_score_max_abs_z": 2.89,
"drifted_features": ["courses_started"],
"predicted_positive_rate": 0.455,
"training_positive_rate": 0.315,
"should_retrain": true,
"reason": "prediction_rate_shift",
"recommended_action": "trigger_retraining_pipeline"
}
Drift Detection Logic
Implemented in _compute_drift_status() (src/api.py:91-172).
Step 1: Load Baseline
Drift baseline loaded from artifacts/drift_baseline.json:
{
"training_positive_rate": 0.315,
"numeric_feature_stats": {
"days_on_platform": {
"mean": 42.5,
"std": 18.3
},
"minutes_watched": {
"mean": 1245.7,
"std": 523.8
},
"courses_started": {
"mean": 2.8,
"std": 1.4
}
}
}
Generated during training by python -m src.train.
Step 2: Compute Current Statistics
For each prediction, the API updates in-memory tracking (src/api.py:256-260):
with _LOCK:
for col in numeric_cols:
_MONITORING["feature_sums"][col] += float(feat_df[col].sum())
_MONITORING["samples"] += len(feat_df)
_MONITORING["predicted_positive"] += int(preds.sum())
Current mean computed as:
current_mean = feature_sums[feature] / samples
Step 3: Calculate Z-Scores
For each feature, compute absolute Z-score (src/api.py:130-141):
for feature, s in feature_sums.items():
current_mean = float(s) / samples
base_mean = float(baseline_stats[feature]["mean"])
base_std = max(float(baseline_stats[feature]["std"]), 1e-6)
abs_z = abs((current_mean - base_mean) / base_std)
max_abs_z = max(max_abs_z, abs_z)
if abs_z >= z_threshold:
drifted_features.append(feature)
Features with abs_z >= z_threshold are flagged as drifted.
Step 4: Compute Prediction Rate Shift
predicted_positive_rate = predicted_positive / samples
class_shift = abs(predicted_positive_rate - training_positive_rate)
Step 5: Determine Retraining Need
Retraining triggered if (src/api.py:146-161):
- Insufficient samples: Wait for
drift_min_samples
- Feature drift:
len(drifted_features) >= drift_min_features
- Prediction rate shift:
class_shift >= class_rate_shift_threshold
if samples < min_samples:
reason = "insufficient_samples"
recommended_action = "collect_more_samples"
else:
if len(drifted_features) >= min_drifted_features:
should_retrain = True
reason = "feature_distribution_drift"
recommended_action = "trigger_retraining_pipeline"
elif class_shift >= class_rate_shift:
should_retrain = True
reason = "prediction_rate_shift"
recommended_action = "trigger_retraining_pipeline"
Configuration
Monitoring parameters in config.yaml:
monitoring:
drift_min_samples: 100
drift_zscore_threshold: 3.0
drift_min_features: 2
class_rate_shift_threshold: 0.10
prediction_log_file: artifacts/prediction_log.jsonl
Parameters
drift_min_samples (default: 100): Minimum predictions before drift detection
drift_zscore_threshold (default: 3.0): Z-score threshold for feature drift
drift_min_features (default: 2): Minimum drifted features to trigger retraining
class_rate_shift_threshold (default: 0.10): Absolute prediction rate change threshold
prediction_log_file: Path to JSON Lines prediction log
Drift Baseline Structure
artifacts/drift_baseline.json contains:
{
"training_positive_rate": 0.315,
"numeric_feature_stats": {
"days_on_platform": {
"mean": 42.5,
"std": 18.3,
"min": 0.0,
"max": 365.0
},
"minutes_watched": {
"mean": 1245.7,
"std": 523.8,
"min": 0.0,
"max": 15000.0
},
"courses_started": {
"mean": 2.8,
"std": 1.4,
"min": 0,
"max": 20
},
"practice_exams_started": {
"mean": 3.2,
"std": 2.1,
"min": 0,
"max": 15
},
"practice_exams_passed": {
"mean": 2.1,
"std": 1.8,
"min": 0,
"max": 15
},
"minutes_spent_on_exams": {
"mean": 145.6,
"std": 89.3,
"min": 0.0,
"max": 800.0
}
}
}
Generated by: Training pipeline (python -m src.train)
Statistics Computed: Mean, std, min, max for all numeric features in training set
Monitoring States
Baseline Not Loaded
{
"samples_observed": 0,
"drift_score_max_abs_z": 0.0,
"drifted_features": [],
"predicted_positive_rate": 0.0,
"training_positive_rate": 0.0,
"should_retrain": false,
"reason": "baseline_not_loaded",
"recommended_action": "run_training_to_generate_baseline"
}
Cause: drift_baseline.json missing or failed to load
Action: Run training to generate baseline
No Predictions Observed
{
"samples_observed": 0,
"drift_score_max_abs_z": 0.0,
"drifted_features": [],
"predicted_positive_rate": 0.0,
"training_positive_rate": 0.315,
"should_retrain": false,
"reason": "no_predictions_observed",
"recommended_action": "collect_inference_samples"
}
Cause: No predictions made since API startup
Action: Send prediction requests to /predict or /batch_predict
Insufficient Samples
{
"samples_observed": 45,
"drift_score_max_abs_z": 1.23,
"drifted_features": [],
"predicted_positive_rate": 0.356,
"training_positive_rate": 0.315,
"should_retrain": false,
"reason": "insufficient_samples",
"recommended_action": "collect_more_samples"
}
Cause: samples_observed < drift_min_samples
Action: Continue sending predictions until minimum threshold reached
Below Threshold
{
"samples_observed": 245,
"drift_score_max_abs_z": 2.34,
"drifted_features": [],
"predicted_positive_rate": 0.328,
"training_positive_rate": 0.315,
"should_retrain": false,
"reason": "below_threshold",
"recommended_action": "continue_monitoring"
}
Cause: No significant drift detected
Action: Continue monitoring
Feature Distribution Drift
{
"samples_observed": 512,
"drift_score_max_abs_z": 4.76,
"drifted_features": ["days_on_platform", "minutes_watched"],
"predicted_positive_rate": 0.342,
"training_positive_rate": 0.315,
"should_retrain": true,
"reason": "feature_distribution_drift",
"recommended_action": "trigger_retraining_pipeline"
}
Cause: len(drifted_features) >= drift_min_features
Action: Trigger retraining pipeline
Prediction Rate Shift
{
"samples_observed": 423,
"drift_score_max_abs_z": 2.89,
"drifted_features": ["courses_started"],
"predicted_positive_rate": 0.455,
"training_positive_rate": 0.315,
"should_retrain": true,
"reason": "prediction_rate_shift",
"recommended_action": "trigger_retraining_pipeline"
}
Cause: abs(predicted_positive_rate - training_positive_rate) >= class_rate_shift_threshold
Action: Trigger retraining pipeline
Prediction Logging
All predictions logged to artifacts/prediction_log.jsonl (src/api.py:175-192):
{"timestamp_utc": "2026-03-04T10:15:30.123456+00:00", "threshold": 0.482, "predicted_purchase_probability": 0.7234, "predicted_purchase": 1, "features": {"student_country": "US", "days_on_platform": 45, "minutes_watched": 1200.5, "courses_started": 3, "practice_exams_started": 5, "practice_exams_passed": 4, "minutes_spent_on_exams": 180.0}}
{"timestamp_utc": "2026-03-04T10:16:12.789012+00:00", "threshold": 0.482, "predicted_purchase_probability": 0.4512, "predicted_purchase": 0, "features": {"student_country": "CA", "days_on_platform": 30, "minutes_watched": 800.0, "courses_started": 2, "practice_exams_started": 3, "practice_exams_passed": 2, "minutes_spent_on_exams": 120.0}}
Log entries include:
timestamp_utc: ISO 8601 timestamp with timezone
threshold: Model decision threshold
predicted_purchase_probability: Raw probability score
predicted_purchase: Binary prediction (0 or 1)
features: Raw input features
Useful for:
- Offline drift analysis
- Model debugging
- Audit trails
- Retraining data collection
Retraining Workflow
- Monitor drift: Poll
/monitoring/drift periodically
- Detect trigger:
should_retrain: true
- Collect samples: Read
prediction_log.jsonl for recent predictions
- Retrain model: Run training pipeline with updated data
- Generate baseline: New
drift_baseline.json created
- Reload API: Restart service or implement hot-reload
- Reset monitoring: In-memory stats cleared on restart
Alerting Integration
Integrate drift endpoint with monitoring systems:
#!/bin/bash
# Poll drift endpoint and alert if retraining needed
RESPONSE=$(curl -s http://localhost:8000/monitoring/drift)
SHOULD_RETRAIN=$(echo $RESPONSE | jq -r '.should_retrain')
if [ "$SHOULD_RETRAIN" = "true" ]; then
REASON=$(echo $RESPONSE | jq -r '.reason')
FEATURES=$(echo $RESPONSE | jq -r '.drifted_features | join(", ")')
echo "ALERT: Drift detected - $REASON"
echo "Drifted features: $FEATURES"
# Trigger retraining pipeline
python -m src.train
# Restart API
systemctl restart prediction-api
fi
Run via cron:
*/30 * * * * /opt/scripts/check_drift.sh
Thread Safety
Monitoring state protected by thread lock (src/api.py:79):
_LOCK = Lock()
# Update during prediction
with _LOCK:
_MONITORING["samples"] += len(feat_df)
_MONITORING["predicted_positive"] += int(preds.sum())
# Read during drift check
with _LOCK:
samples = int(_MONITORING["samples"])
feature_sums = dict(_MONITORING["feature_sums"])
predicted_positive = int(_MONITORING["predicted_positive"])
Safe for concurrent requests.
Limitations
- In-memory state: Monitoring resets on API restart (not persistent)
- Global baseline: Single baseline for all models (no A/B test support)
- Cumulative tracking: No sliding window or time-based decay
- Simple Z-score: Assumes Gaussian distributions
- No covariate shift detection: Only marginal feature distributions tracked
Best Practices
- Set appropriate thresholds: Tune
drift_zscore_threshold based on false positive rate
- Monitor continuously: Poll
/monitoring/drift every 15-30 minutes
- Review drifted features: Investigate why specific features drift
- Collect ground truth: Track actual purchase outcomes for retraining labels
- Version baselines: Store
drift_baseline.json with model artifacts
- Test retraining: Validate new models before production deployment