Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt

Use this file to discover all available pages before exploring further.

Feature engineering in this project involves two distinct operations: selecting the right subsets of the 76 Statcast columns as inputs (X) for each of the three Random Forest classifiers, and constructing target labels (y) from computed thresholds applied to the raw statistical columns. Rather than feeding all 76 features into each model, each classifier is trained on a focused feature set whose variables have a strong theoretical and empirical connection to the outcome being predicted. The target labels are not sourced from a pre-existing column — they are derived by applying quantile cuts and threshold rules to wOBA and discipline metrics, turning continuous batting statistics into discrete class labels.

Feature Sets per Model

Each of the three classifiers uses a different feature set, chosen to align the input variables with the specific performance dimension being classified.
The performance classifier predicts a batter’s general offensive tier (Bajo / Medio / Alto). It uses four features that together capture both the quality and quantity of offensive production:
Caracteristicas = ["xwoba", "hardhit_percent", "barrels_total", "hits"]

X = df[Caracteristicas]
Y = df["Rendimiento_labels"]
FeatureWhy it was chosen
xwobaExpected weighted on-base average is the single best Statcast summary of overall offensive value, controlling for luck and park effects. It is the most direct predictor of a batter’s offensive tier.
hardhit_percentThe percentage of batted balls at 95+ mph exit velocity measures how often a batter makes genuinely hard contact — a consistent differentiator between average and above-average hitters.
barrels_totalThe raw count of barreled balls (optimal exit velocity + launch angle combinations) captures power production in absolute terms, complementing the rate-based signal in xwOBA and hardhit_percent.
hitsTotal hit count anchors the feature set in real production volume. It ensures that players with elite quality metrics but minimal plate appearances are not over-classified by rate stats alone.
The target label Rendimiento_labels is constructed from woba (not xwoba), so the model learns to predict realized performance tiers using expected/contact-quality inputs. This separation helps the classifier generalize beyond luck-influenced outcomes.

Target Variable Construction

Each of the three models uses a different labeling strategy. The target variable is always derived programmatically from existing numeric columns — never manually annotated.

1. Performance Labels — wOBA Quantile Tiers

The Rendimiento_labels column divides all 857 batters into three equal-sized groups based on their actual wOBA. Using pd.qcut with q=3 ensures balanced class distribution, which is important for classifier stability.
df["Rendimiento_labels"] = pd.qcut(
    df["woba"],
    q=3,
    labels=["Bajo", "Medio", "Alto"]
)
LabelMeaning
BajoBottom third of wOBA values — below-average offensive performance
MedioMiddle third — average to slightly above-average production
AltoTop third — strong offensive producers
Because pd.qcut places exactly one-third of batters in each bin, the Performance model starts with a perfectly balanced three-class target. No resampling or class weighting is required for this classifier.

2. Elite Hitter Label — Top 20% wOBA Threshold

The elite_hitter column is a binary label: 1 if the batter’s wOBA is at or above the 80th percentile across the full clean dataset, 0 otherwise.
df["elite_hitter"] = (df["woba"] >= df["woba"].quantile(0.80)).astype(int)
This produces a hard cutoff that cleanly separates the truly exceptional batters from the field.
This is a deliberately imbalanced label by design — only the top 20% qualify. The resulting class distribution is 684 non-elite (0) vs. 173 elite (1), a roughly 4:1 ratio. See Class Distribution below for how the model handles this imbalance.

3. Plate Discipline Label — Composite Score with Fixed Bins

The discipline target is constructed in two steps. First, a composite disciplina_en_home score is computed as a weighted linear combination of swing behavior metrics:
df["disciplina_en_home"] = (
    df["bb_percent"]        * 0.30 +
    df["takes"]             * 0.20 +
    df["pa"]                * 0.10
    - df["k_percent"]       * 0.20
    - df["swing_miss_percent"] * 0.10
    - df["whiffs"]          * 0.05
    - df["swings"]          * 0.05
)
Then, fixed bins are applied via pd.cut to assign each batter to a discipline class:
df["clase_disciplina_home"] = pd.cut(
    df["disciplina_en_home"],
    bins=[-999, 200, 800, 1200],
    labels=["Baja", "Media", "Alta"]
)
LabelScore RangeMeaning
Baja< 200Low plate discipline — aggressive, high whiff/K rates
Media200 – 800Average discipline — typical major league profile
Alta> 800High plate discipline — selective, low whiff/K rates
The positive weights on bb_percent and takes reward batters who work counts and lay off bad pitches. The negative weights on k_percent, swing_miss_percent, whiffs, and swings penalize free-swingers and batters who chase out of the zone. The pa term adds a mild volume bonus, so that batters with more tracked opportunities score slightly higher when all else is equal.

Train/Test Split

All three models use an 80/20 train/test split with random_state=42 for reproducibility. The Elite and Discipline models also pass stratify=y to preserve the class distribution in both splits, which is critical given the class imbalance in elite_hitter and the uneven bin sizes in clase_disciplina_home. Performance model — balanced target, no stratification needed:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
Elite and Discipline models — imbalanced targets, stratification required:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
The Performance model uses uppercase Y for the target variable (Y = df["Rendimiento_labels"]) and omits stratify because pd.qcut already produces a balanced three-class distribution. The Elite and Discipline models use lowercase y and explicitly pass stratify=y to ensure the minority class is proportionally represented in the test set.

Class Distribution

Understanding the class balance in each target variable is critical before training, as imbalanced classes can cause a classifier to ignore the minority class entirely.

Performance Labels (balanced by construction)

Rendimiento_labels
Bajo      ~286
Medio     ~286
Alto      ~285
Three roughly equal groups, produced automatically by pd.qcut.

Elite Hitter (imbalanced)

elite_hitter
0    684    (non-elite — ~80%)
1    173    (elite     — ~20%)
With only 173 elite batters against 684 non-elite, the classifier would default to predicting “non-elite” for nearly every batter without corrective action. This is addressed by training the Random Forest with class_weight='balanced':
from sklearn.ensemble import RandomForestClassifier

rf_elite = RandomForestClassifier(
    n_estimators=300,
    max_depth=8,
    class_weight="balanced",
    random_state=42
)
rf_elite.fit(X_train, y_train)
The balanced setting automatically adjusts sample weights so that the minority class (elite hitters) receives proportionally more influence during tree construction, compensating for the 4:1 ratio.

Plate Discipline Classes (highly skewed)

clase_disciplina_home
Baja      465    (~54%)
Media     361    (~42%)
Alta       31    ( ~4%)
The Alta discipline class contains only 31 batters out of 857 — under 4% of the dataset. This extreme skew means the discipline classifier faces the most challenging class imbalance of the three models. Using class_weight='balanced' and evaluating with macro-averaged precision/recall (rather than accuracy alone) is essential to get a meaningful picture of model performance on the Alta class.

Build docs developers (and LLMs) love