Three classification labels are constructed directly from the raw Statcast data. Rather than predicting a numeric value, each target frames batter performance as a classification problem — grouping players into discrete, meaningful categories based on real offensive metrics. This approach allows the Random Forest models to learn which statistical signatures distinguish performance tiers, elite ability, and plate decision-making quality from one another.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
- Overall Performance
- Elite Status
- Plate Discipline
Rendimiento_labels — Overall Performance
The first target,Rendimiento_labels, captures a batter’s general offensive production by binning their wOBA (Weighted On-Base Average) into three equal quantile groups: Bajo (Low), Medio (Medium), and Alto (High).Why wOBA?
wOBA is a comprehensive offensive metric that assigns different weights to each type of hit based on its actual run value — a walk is worth less than a single, which is worth less than a double, and so on. This makes it a far more accurate measure of overall offensive contribution than simpler stats like batting average or on-base percentage.Label Construction
Usingpd.qcut with q=3 ensures that each class receives approximately the same number of players, producing balanced training labels:Classes
| Label | Meaning | Quantile Range |
|---|---|---|
Bajo | Low performance | Bottom third of wOBA |
Medio | Medium performance | Middle third of wOBA |
Alto | High performance | Top third of wOBA |
Quantile-based splitting guarantees balanced class sizes by design. This avoids the class imbalance issues that can arise when using fixed thresholds on skewed distributions.
Why These Three Targets?
Each target variable measures a fundamentally different dimension of batter performance. Together, they provide a multi-angle picture of what it means to be an effective MLB hitter:| Target | Dimension | Question Answered |
|---|---|---|
Rendimiento_labels | Volume Production | How much offensive value does this batter produce overall? |
elite_hitter | Peak Performance | Does this batter belong among the very best in the league? |
plate_discipline | Process Quality | How well does this batter make decisions at the plate? |
Medio on overall performance, 0 on elite status, but Alta on plate discipline. A slugger might be Alto and elite, but only Media on discipline if they chase breaking balls. Modeling these targets separately lets the project uncover which Statcast features drive each dimension.
Label Distribution Summary
Rendimiento_labels is intentionally balanced (equal quantile splits), while elite_hitter is intentionally imbalanced (top 20% threshold). clase_disciplina_home uses fixed score bins, so its distribution reflects the natural spread of discipline scores across the batter population.