Three separate Random Forest classifiers are trained in this project, one for each target variable: overall performance (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
Rendimiento_labels), elite status (elite_hitter), and plate discipline (clase_disciplina_home). Random forests are an ensemble of decision trees that reduce overfitting through bagging (training each tree on a random bootstrap sample of the data) and random feature selection (evaluating only a random subset of features at each split). The result is a model that generalizes substantially better than any single decision tree.
Why Random Forest?
Using a Random Forest over a standalone decision tree offers several concrete advantages for this dataset:- Variance reduction via ensemble averaging — predictions are averaged across 300 trees, smoothing out the idiosyncratic errors of any individual tree.
- Handles mixed numeric features — all Statcast inputs (percentages, counts, rate stats) are numeric and require no special encoding for tree-based models.
- Built-in feature importance —
rf.feature_importances_provides a normalized importance score per feature after training, directly interpretable for baseball analysis. - Robust to outliers — tree splits are based on rank ordering rather than raw values, so extreme outliers in metrics like
barrels_totalorwhiffsdo not distort the model.
A
DecisionTreeClassifier was trained first as a baseline for the performance model. This provides a concrete reference point: any Random Forest result that exceeds the Decision Tree’s metrics represents a measurable gain from the ensemble approach.Model Configurations
- Performance Model
- Elite Model
- Plate Discipline Model
Model 1 — Overall Performance Classifier
Target:Rendimiento_labels (Bajo / Medio / Alto)This model classifies batters into three tiers of general offensive production using four core Statcast metrics that capture expected quality of contact, hard-hit rate, barrel power, and hit count.Feature Set
| Feature | Description |
|---|---|
xwoba | Expected wOBA — quality-of-contact adjusted offensive value |
hardhit_percent | Percentage of batted balls hit at 95+ mph exit velocity |
barrels_total | Count of batted-ball events in the optimal launch angle/EV zone |
hits | Total hit count for the season |
Feature Importance Finding
After training, feature importance analysis shows that xwOBA and hits are the strongest predictors of overall performance tier — expected contact quality and actual hit production together capture most of the signal.Hyperparameter Reference
All three models share a common base configuration. The table below describes each hyperparameter and its purpose:| Parameter | Value | Purpose |
|---|---|---|
n_estimators | 300 | Number of decision trees in the ensemble. More trees reduce variance and stabilize predictions; 300 provides a good balance between performance and training time. |
max_depth | 8 | Maximum allowed depth for each individual tree. Limits tree complexity to prevent overfitting on the training set. |
random_state | 42 | Seed for all random operations (bootstrap sampling, feature selection at each split). Guarantees identical results across runs. |
class_weight | 'balanced' | (Elite and Discipline models only) Automatically adjusts per-class sample weights inversely proportional to class frequency. Compensates for imbalanced label distributions. |
n_jobs | -1 | (Discipline model only) Uses all available CPU cores to parallelize tree training. |
The
max_depth=8 setting is the primary regularization control. Without a depth limit, individual trees would grow until they perfectly memorize the training data, leading to high variance when evaluated on new batters.Decision Tree Baseline
Before training each Random Forest, aDecisionTreeClassifier was used as a single-model baseline for the performance classification task:
Feature Importance
After training, each model’sfeature_importances_ attribute returns a normalized array where all values sum to 1.0. Higher scores indicate stronger predictive contribution: