Each classifier is evaluated on a held-out test set representing 20% of the data — batters the model never saw during training. Four standard classification metrics from scikit-learn are computed for every model: accuracy, precision, recall, and F1 score. A Decision Tree baseline trained on the same features provides a concrete reference point for measuring the improvement that comes from using a Random Forest ensemble.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation Metrics
Each metric captures a different aspect of model quality. In the context of classifying MLB batters, they can be interpreted as follows:Accuracy
The fraction of all batters in the test set that were correctly classified. A model that correctly identifies 80 out of 100 batters has 80% accuracy.Precision
Of all the batters the model predicted as belonging to class X, what fraction actually belonged to class X? High precision means few false positives — the model is selective and confident when it assigns a label.Recall
Of all the batters who actually belong to class X, what fraction did the model successfully identify? High recall means few false negatives — the model catches most of the true members of each class.F1 Score
The harmonic mean of precision and recall. It is particularly valuable when classes are imbalanced, because it penalizes models that sacrifice one metric to inflate the other.All multi-class metrics (precision, recall, F1) are computed with
average='weighted', which calculates the metric for each class and averages them weighted by the number of true instances per class. This accounts for class size differences in the three-class models.Decision Tree Baseline Results
The Decision Tree classifier — trained on the same features and split as the performance Random Forest — produced the following results on the test set:A Decision Tree baseline accuracy of ~63% on a balanced three-class problem (33% random chance) confirms that the features carry meaningful signal. The question is how much of that signal the Random Forest can unlock through ensemble learning.
Evaluation Code
The same evaluation snippet is applied to all three models. Swap in the appropriaterf, X_test, and y_test variables for each classifier:
Per-Model Evaluation
For the elite classifier, precision and recall are computed without averaging to inspect the minority class directly:Random Forest vs Decision Tree
Random Forest is expected to outperform the Decision Tree baseline across all four metrics. The ensemble mechanism reduces variance by averaging predictions across 300 independently trained trees, each built on a different bootstrap sample with a different random feature subset at each split. Where the single Decision Tree might overfit to specific players in the training data, the Random Forest’s averaged vote smooths out those individual errors — producing more reliable predictions on the held-out test batters.
| Property | Decision Tree | Random Forest |
|---|---|---|
| Interpretability | High — can be visualized as a flowchart | Lower — 300 trees cannot be read directly |
| Overfitting risk | High (especially without depth limits) | Low (ensemble averaging reduces variance) |
| Training time | Fast | Slower (300× the trees) |
| Feature importance | Available | Available — averaged across all trees |
| Generalization | Weaker | Stronger |
Feature Importance Visualization
After evaluating model performance, visualizing feature importances confirms which Statcast metrics are driving each classifier’s decisions. Thefeature_importances_ attribute returns a normalized array (values sum to 1.0):
Key Findings by Model
| Model | Most Important Features |
|---|---|
Performance (Rendimiento_labels) | xwoba, hits |
Elite Status (elite_hitter) | xwoba, barrels_total |
Plate Discipline (clase_disciplina_home) | swing_miss_percent, bb_percent |