Feature engineering in this project involves two distinct operations: selecting the right subsets of the 76 Statcast columns as inputs (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
X) for each of the three Random Forest classifiers, and constructing target labels (y) from computed thresholds applied to the raw statistical columns. Rather than feeding all 76 features into each model, each classifier is trained on a focused feature set whose variables have a strong theoretical and empirical connection to the outcome being predicted. The target labels are not sourced from a pre-existing column — they are derived by applying quantile cuts and threshold rules to wOBA and discipline metrics, turning continuous batting statistics into discrete class labels.
Feature Sets per Model
Each of the three classifiers uses a different feature set, chosen to align the input variables with the specific performance dimension being classified.- Performance Model
- Elite Model
- Plate Discipline Model
The performance classifier predicts a batter’s general offensive tier (Bajo / Medio / Alto). It uses four features that together capture both the quality and quantity of offensive production:
| Feature | Why it was chosen |
|---|---|
xwoba | Expected weighted on-base average is the single best Statcast summary of overall offensive value, controlling for luck and park effects. It is the most direct predictor of a batter’s offensive tier. |
hardhit_percent | The percentage of batted balls at 95+ mph exit velocity measures how often a batter makes genuinely hard contact — a consistent differentiator between average and above-average hitters. |
barrels_total | The raw count of barreled balls (optimal exit velocity + launch angle combinations) captures power production in absolute terms, complementing the rate-based signal in xwOBA and hardhit_percent. |
hits | Total hit count anchors the feature set in real production volume. It ensures that players with elite quality metrics but minimal plate appearances are not over-classified by rate stats alone. |
The target label
Rendimiento_labels is constructed from woba (not xwoba), so the model learns to predict realized performance tiers using expected/contact-quality inputs. This separation helps the classifier generalize beyond luck-influenced outcomes.Target Variable Construction
Each of the three models uses a different labeling strategy. The target variable is always derived programmatically from existing numeric columns — never manually annotated.1. Performance Labels — wOBA Quantile Tiers
TheRendimiento_labels column divides all 857 batters into three equal-sized groups based on their actual wOBA. Using pd.qcut with q=3 ensures balanced class distribution, which is important for classifier stability.
| Label | Meaning |
|---|---|
Bajo | Bottom third of wOBA values — below-average offensive performance |
Medio | Middle third — average to slightly above-average production |
Alto | Top third — strong offensive producers |
Because
pd.qcut places exactly one-third of batters in each bin, the Performance model starts with a perfectly balanced three-class target. No resampling or class weighting is required for this classifier.2. Elite Hitter Label — Top 20% wOBA Threshold
Theelite_hitter column is a binary label: 1 if the batter’s wOBA is at or above the 80th percentile across the full clean dataset, 0 otherwise.
3. Plate Discipline Label — Composite Score with Fixed Bins
The discipline target is constructed in two steps. First, a compositedisciplina_en_home score is computed as a weighted linear combination of swing behavior metrics:
pd.cut to assign each batter to a discipline class:
| Label | Score Range | Meaning |
|---|---|---|
Baja | < 200 | Low plate discipline — aggressive, high whiff/K rates |
Media | 200 – 800 | Average discipline — typical major league profile |
Alta | > 800 | High plate discipline — selective, low whiff/K rates |
Train/Test Split
All three models use an 80/20 train/test split withrandom_state=42 for reproducibility. The Elite and Discipline models also pass stratify=y to preserve the class distribution in both splits, which is critical given the class imbalance in elite_hitter and the uneven bin sizes in clase_disciplina_home.
Performance model — balanced target, no stratification needed:
The Performance model uses uppercase
Y for the target variable (Y = df["Rendimiento_labels"]) and omits stratify because pd.qcut already produces a balanced three-class distribution. The Elite and Discipline models use lowercase y and explicitly pass stratify=y to ensure the minority class is proportionally represented in the test set.Class Distribution
Understanding the class balance in each target variable is critical before training, as imbalanced classes can cause a classifier to ignore the minority class entirely.Performance Labels (balanced by construction)
pd.qcut.
Elite Hitter (imbalanced)
class_weight='balanced':
balanced setting automatically adjusts sample weights so that the minority class (elite hitters) receives proportionally more influence during tree construction, compensating for the 4:1 ratio.