Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt

Use this file to discover all available pages before exploring further.

Three classification labels are constructed directly from the raw Statcast data. Rather than predicting a numeric value, each target frames batter performance as a classification problem — grouping players into discrete, meaningful categories based on real offensive metrics. This approach allows the Random Forest models to learn which statistical signatures distinguish performance tiers, elite ability, and plate decision-making quality from one another.

Rendimiento_labels — Overall Performance

The first target, Rendimiento_labels, captures a batter’s general offensive production by binning their wOBA (Weighted On-Base Average) into three equal quantile groups: Bajo (Low), Medio (Medium), and Alto (High).

Why wOBA?

wOBA is a comprehensive offensive metric that assigns different weights to each type of hit based on its actual run value — a walk is worth less than a single, which is worth less than a double, and so on. This makes it a far more accurate measure of overall offensive contribution than simpler stats like batting average or on-base percentage.

Label Construction

Using pd.qcut with q=3 ensures that each class receives approximately the same number of players, producing balanced training labels:
df["Rendimiento_labels"] = pd.qcut(
    df["woba"],
    q=3,
    labels=["Bajo", "Medio", "Alto"]
)

Classes

LabelMeaningQuantile Range
BajoLow performanceBottom third of wOBA
MedioMedium performanceMiddle third of wOBA
AltoHigh performanceTop third of wOBA
Quantile-based splitting guarantees balanced class sizes by design. This avoids the class imbalance issues that can arise when using fixed thresholds on skewed distributions.

Why These Three Targets?

Each target variable measures a fundamentally different dimension of batter performance. Together, they provide a multi-angle picture of what it means to be an effective MLB hitter:
TargetDimensionQuestion Answered
Rendimiento_labelsVolume ProductionHow much offensive value does this batter produce overall?
elite_hitterPeak PerformanceDoes this batter belong among the very best in the league?
plate_disciplineProcess QualityHow well does this batter make decisions at the plate?
A batter can score differently across all three dimensions. A high-contact, low-power hitter might rank Medio on overall performance, 0 on elite status, but Alta on plate discipline. A slugger might be Alto and elite, but only Media on discipline if they chase breaking balls. Modeling these targets separately lets the project uncover which Statcast features drive each dimension.

Label Distribution Summary

# Overall Performance
print(df["Rendimiento_labels"].value_counts())
# Bajo     ~286
# Medio    ~286
# Alto     ~285

# Elite Status
print(df["elite_hitter"].value_counts())
# 0    684
# 1    173

# Plate Discipline
print(df["clase_disciplina_home"].value_counts())
# Media    (majority)
# Baja
# Alta
Rendimiento_labels is intentionally balanced (equal quantile splits), while elite_hitter is intentionally imbalanced (top 20% threshold). clase_disciplina_home uses fixed score bins, so its distribution reflects the natural spread of discipline scores across the batter population.

Build docs developers (and LLMs) love