Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt

Use this file to discover all available pages before exploring further.

Three separate Random Forest classifiers are trained in this project, one for each target variable: overall performance (Rendimiento_labels), elite status (elite_hitter), and plate discipline (clase_disciplina_home). Random forests are an ensemble of decision trees that reduce overfitting through bagging (training each tree on a random bootstrap sample of the data) and random feature selection (evaluating only a random subset of features at each split). The result is a model that generalizes substantially better than any single decision tree.

Why Random Forest?

Using a Random Forest over a standalone decision tree offers several concrete advantages for this dataset:
  • Variance reduction via ensemble averaging — predictions are averaged across 300 trees, smoothing out the idiosyncratic errors of any individual tree.
  • Handles mixed numeric features — all Statcast inputs (percentages, counts, rate stats) are numeric and require no special encoding for tree-based models.
  • Built-in feature importancerf.feature_importances_ provides a normalized importance score per feature after training, directly interpretable for baseball analysis.
  • Robust to outliers — tree splits are based on rank ordering rather than raw values, so extreme outliers in metrics like barrels_total or whiffs do not distort the model.
A DecisionTreeClassifier was trained first as a baseline for the performance model. This provides a concrete reference point: any Random Forest result that exceeds the Decision Tree’s metrics represents a measurable gain from the ensemble approach.

Model Configurations

Model 1 — Overall Performance Classifier

Target: Rendimiento_labels (Bajo / Medio / Alto)This model classifies batters into three tiers of general offensive production using four core Statcast metrics that capture expected quality of contact, hard-hit rate, barrel power, and hit count.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Features and target
Caracteristicas = ["xwoba", "hardhit_percent", "barrels_total", "hits"]
X = df[Caracteristicas]
y = df["Rendimiento_labels"]

# Train/test split — 80% training, 20% held-out test
X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
rf = RandomForestClassifier(n_estimators=300, max_depth=8, random_state=42)
rf.fit(X_train, Y_train)

Feature Set

FeatureDescription
xwobaExpected wOBA — quality-of-contact adjusted offensive value
hardhit_percentPercentage of batted balls hit at 95+ mph exit velocity
barrels_totalCount of batted-ball events in the optimal launch angle/EV zone
hitsTotal hit count for the season

Feature Importance Finding

After training, feature importance analysis shows that xwOBA and hits are the strongest predictors of overall performance tier — expected contact quality and actual hit production together capture most of the signal.

Hyperparameter Reference

All three models share a common base configuration. The table below describes each hyperparameter and its purpose:
ParameterValuePurpose
n_estimators300Number of decision trees in the ensemble. More trees reduce variance and stabilize predictions; 300 provides a good balance between performance and training time.
max_depth8Maximum allowed depth for each individual tree. Limits tree complexity to prevent overfitting on the training set.
random_state42Seed for all random operations (bootstrap sampling, feature selection at each split). Guarantees identical results across runs.
class_weight'balanced'(Elite and Discipline models only) Automatically adjusts per-class sample weights inversely proportional to class frequency. Compensates for imbalanced label distributions.
n_jobs-1(Discipline model only) Uses all available CPU cores to parallelize tree training.
The max_depth=8 setting is the primary regularization control. Without a depth limit, individual trees would grow until they perfectly memorize the training data, leading to high variance when evaluated on new batters.

Decision Tree Baseline

Before training each Random Forest, a DecisionTreeClassifier was used as a single-model baseline for the performance classification task:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=6, min_samples_split=20, random_state=42)
dt.fit(X_train, Y_train)

y_pred_dt = dt.predict(X_test)
The Decision Tree was evaluated with the same metrics as the Random Forest (accuracy, precision, recall, F1). Its results serve as the lower-bound benchmark — any Random Forest score above these figures represents a direct, quantifiable improvement attributable to the ensemble approach. See the Evaluation page for the full baseline numbers.

Feature Importance

After training, each model’s feature_importances_ attribute returns a normalized array where all values sum to 1.0. Higher scores indicate stronger predictive contribution:
import matplotlib.pyplot as plt
import numpy as np

importancias = rf.feature_importances_
indices = np.argsort(importancias)

plt.figure(figsize=(10, 6))
plt.title('Importancia de las Características - Random Forest')
plt.barh(range(len(indices)), importancias[indices], align='center')
plt.yticks(range(len(indices)), [Caracteristicas[i] for i in indices])
plt.show()
Feature importances in scikit-learn Random Forests are computed as the mean decrease in impurity (Gini impurity for classification) across all trees and all splits where a feature is used. Features that appear higher in more trees and produce cleaner splits receive higher importance scores.

Build docs developers (and LLMs) love