Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt

Use this file to discover all available pages before exploring further.

This dataset forms the foundation of the MLB Performance Analytics project. It captures 911 MLB batters across 76 Statcast features, aggregated over the 2023, 2024, and 2025 regular seasons. Each row represents one player’s cumulative performance profile, blending traditional counting stats with modern Statcast-derived metrics such as expected weighted on-base average (xwOBA), exit velocity, barrel rates, and granular swing mechanics. The breadth of features makes it well-suited for multi-dimensional classification of batter performance using Random Forest models.

Source

The data was retrieved from Baseball Savant Statcast Search — the official MLB platform for granular pitch-by-pitch data. The search was configured to pull regular season data for the 2023, 2024, and 2025 seasons, grouped by batter name (group_by=name) so that each row aggregates all plate appearances for a given player across those seasons.
The full Statcast search URL used to export the data is available in the project notebook under Origen del dataset. It includes filters for regular season games (hfGT=R) and covers all three seasons via hfSea=2025|2024|2023.

Dataset Shape

The raw CSV contains 911 rows (one per batter) and 76 columns (features). After dropping all rows with any missing values, the working dataset is reduced to 857 rows.
import pandas as pd

ruta = "Data/MLBDATA.csv"
df = pd.read_csv(ruta)

print(df.shape)
# (911, 76)
The output of df.info() confirms the column types and null counts across all 76 features:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 76 columns):
 #   Column                                    Non-Null Count  Dtype
---  ------                                    --------------  -----
 0   pitches                                   911 non-null    int64
 1   player_id                                 911 non-null    int64
 2   player_name                               911 non-null    object
 3   total_pitches                             911 non-null    int64
 4   pitch_percent                             911 non-null    float64
 5   ba                                        911 non-null    float64
 6   iso                                       911 non-null    float64
 7   babip                                     898 non-null    float64
 8   slg                                       911 non-null    float64
 9   woba                                      911 non-null    float64
10   xwoba                                     911 non-null    float64
11   xba                                       898 non-null    float64
12   hits                                      911 non-null    int64
13   abs                                       911 non-null    int64
14   launch_speed                              898 non-null    float64
15   launch_angle                              898 non-null    float64
16   spin_rate                                 911 non-null    int64
17   velocity                                  911 non-null    float64
18   effective_speed                           911 non-null    float64
19   whiffs                                    911 non-null    int64
20   swings                                    911 non-null    int64
21   takes                                     911 non-null    int64
22   eff_min_vel                               911 non-null    float64
23   release_extension                         911 non-null    float64
24   pos3_int_start_distance                   911 non-null    int64
25   pos4_int_start_distance                   911 non-null    int64
26   pos5_int_start_distance                   911 non-null    int64
27   pos6_int_start_distance                   911 non-null    int64
28   pos7_int_start_distance                   911 non-null    int64
29   pos8_int_start_distance                   911 non-null    int64
30   pos9_int_start_distance                   911 non-null    int64
31   pitcher_run_exp                           911 non-null    float64
32   run_exp                                   911 non-null    float64
33   bat_speed                                 865 non-null    float64
34   swing_length                              865 non-null    float64
35   pa                                        911 non-null    int64
36   bip                                       911 non-null    int64
37   singles                                   911 non-null    int64
38   doubles                                   911 non-null    int64
39   triples                                   911 non-null    int64
40   hrs                                       911 non-null    int64
41   so                                        911 non-null    int64
42   k_percent                                 911 non-null    float64
43   bb                                        911 non-null    int64
44   bb_percent                                911 non-null    float64
45   api_break_z_with_gravity                  911 non-null    float64
46   api_break_z_induced                       911 non-null    float64
47   api_break_x_arm                           911 non-null    float64
48   api_break_x_batter_in                     911 non-null    float64
49   hyper_speed                               906 non-null    float64
50   bbdist                                    906 non-null    float64
51   hardhit_percent                           898 non-null    float64
52   barrels_per_bbe_percent                   898 non-null    float64
53   barrels_per_pa_percent                    898 non-null    float64
54   release_pos_z                             911 non-null    float64
55   release_pos_x                             911 non-null    float64
56   plate_x                                   911 non-null    float64
57   plate_z                                   911 non-null    float64
58   obp                                       911 non-null    float64
59   barrels_total                             898 non-null    float64
60   batter_run_value_per_100                  911 non-null    float64
61   xobp                                      899 non-null    float64
62   xslg                                      898 non-null    float64
63   pitcher_run_value_per_100                 911 non-null    float64
64   xbadiff                                   898 non-null    float64
65   xobpdiff                                  899 non-null    float64
66   xslgdiff                                  898 non-null    float64
67   wobadiff                                  911 non-null    float64
68   swing_miss_percent                        908 non-null    float64
69   arm_angle                                 911 non-null    float64
70   attack_angle                              865 non-null    float64
71   attack_direction                          865 non-null    float64
72   swing_path_tilt                           865 non-null    float64
73   rate_ideal_attack_angle                   865 non-null    float64
74   intercept_ball_minus_batter_pos_x_inches  865 non-null    float64
75   intercept_ball_minus_batter_pos_y_inches  865 non-null    float64
dtypes: float64(51), int64(24), object(1)
memory usage: 541.0+ KB
The dataset is composed of 51 float64 columns, 24 int64 columns, and 1 object column (player_name). The majority of numeric columns are complete across all 911 rows; missing values are concentrated in Statcast-era batting mechanics metrics.

Feature Categories

The 76 columns span seven distinct categories of baseball metrics. Expand each accordion below to see the full list and a brief description.
These two columns uniquely identify each batter in the dataset. They are excluded from model training features but are used for filtering, display, and EDA.
ColumnDescription
player_idUnique MLB/Statcast numeric identifier for the batter
player_namePlayer’s full name in Last, First format (e.g., Judge, Aaron)
Classic counting and rate stats that have been core to baseball analysis for over a century. These form the baseline layer of each batter’s profile.
ColumnDescription
baBatting average (H / AB)
isoIsolated power (SLG − BA); measures raw extra-base power
babipBatting average on balls in play; indicator of luck and defense
slgSlugging percentage (total bases / AB)
obpOn-base percentage
hitsTotal hits across the aggregated seasons
absTotal at-bats
singlesTotal singles
doublesTotal doubles
triplesTotal triples
hrsTotal home runs
soTotal strikeouts
bbTotal walks (bases on balls)
k_percentStrikeout rate as a percentage of plate appearances
bb_percentWalk rate as a percentage of plate appearances
paTotal plate appearances
bipBalls put in play
Statcast-derived expected and weighted metrics that capture the true offensive value of a batter, adjusting for luck, defense, and park factors.
ColumnDescription
wobaWeighted on-base average; single-number offensive value metric
xwobaExpected wOBA based on contact quality (exit velocity + launch angle)
xbaExpected batting average
xobpExpected on-base percentage
xslgExpected slugging percentage
wobadiffDifference between actual wOBA and xwOBA (luck/defense signal)
xbadiffDifference between actual BA and xBA
xobpdiffDifference between actual OBP and xOBP
xslgdiffDifference between actual SLG and xSLG
batter_run_value_per_100Run value generated by the batter per 100 pitches
pitcher_run_value_per_100Run value against the pitchers faced per 100 pitches
barrels_totalTotal number of barreled balls (optimal exit velocity + angle)
barrels_per_bbe_percentBarrel rate per batted ball event (%)
barrels_per_pa_percentBarrel rate per plate appearance (%)
Statcast measurements of how hard and at what angle the batter makes contact — the physical output of each swing that makes contact.
ColumnDescription
launch_speedAverage exit velocity (mph) on batted balls
launch_angleAverage launch angle (degrees) on batted balls
hardhit_percentPercentage of batted balls hit at 95+ mph exit velocity
hyper_speedExtreme exit velocity metric for the hardest-hit balls
bbdistAverage batted ball distance (feet)
bat_speedAverage bat speed at contact (mph)
swing_lengthAverage path length of the bat through the swing zone (feet)
Metrics describing the characteristics of pitches seen by each batter — velocity, movement, spin, and release point data aggregated across all pitches faced.
ColumnDescription
pitchesTotal pitches seen
total_pitchesTotal pitches in the broader query window
pitch_percentPercentage of total pitches tracked with full Statcast data
velocityAverage pitch velocity faced (mph)
effective_speedAverage perceived velocity (adjusted for extension)
spin_rateAverage spin rate of pitches faced (RPM)
eff_min_velEffective velocity differential from minimum pitch velocity
release_extensionAverage pitcher extension toward home plate (feet)
release_pos_zAverage vertical release position of pitches faced (feet)
release_pos_xAverage horizontal release position of pitches faced (feet)
api_break_z_with_gravityTotal vertical break including gravity (inches)
api_break_z_inducedVertical break induced by spin only (inches)
api_break_x_armHorizontal break from arm side (inches)
api_break_x_batter_inHorizontal break toward batter (inches)
Where the pitch crosses the plate and how the batter responds — covering swing decisions, miss rates, and fine-grained biomechanical swing path measurements.
ColumnDescription
plate_xAverage horizontal pitch location at plate crossing (feet from center)
plate_zAverage vertical pitch location at plate crossing (feet from ground)
swingsTotal swings taken
whiffsTotal swings and misses
takesTotal pitches taken (not swung at)
swing_miss_percentWhiff rate: whiffs / swings (%)
attack_angleAverage upward angle of the bat through the hitting zone (degrees)
attack_directionLateral direction of the swing path (degrees)
swing_path_tiltTilt of the overall swing plane (degrees)
rate_ideal_attack_angleProportion of swings with an optimal attack angle
arm_angleAverage arm angle of pitchers faced (degrees)
intercept_ball_minus_batter_pos_x_inchesHorizontal offset between ball path and batter position at contact (inches)
intercept_ball_minus_batter_pos_y_inchesVertical offset between ball path and batter position at contact (inches)
run_expRun expectancy added by the batter’s plate appearances
pitcher_run_expRun expectancy from the pitcher’s perspective
Starting distances (in feet) of each fielder (positions 3–9) at the moment the pitch was delivered, averaged across all tracked plate appearances. These reflect the defensive alignments batter typically face.
ColumnDescription
pos3_int_start_distanceFirst baseman starting distance (feet)
pos4_int_start_distanceSecond baseman starting distance (feet)
pos5_int_start_distanceThird baseman starting distance (feet)
pos6_int_start_distanceShortstop starting distance (feet)
pos7_int_start_distanceLeft fielder starting distance (feet)
pos8_int_start_distanceCenter fielder starting distance (feet)
pos9_int_start_distanceRight fielder starting distance (feet)

Missing Values

Of the 911 raw rows, 54 rows were dropped to produce the clean 857-row working dataset. The null values are concentrated in Statcast biomechanics fields that require a minimum sample threshold to compute reliably. The columns with the most missing values are:
ColumnMissing Rows
bat_speed46
swing_length46
attack_angle46
attack_direction46
swing_path_tilt46
rate_ideal_attack_angle46
intercept_ball_minus_batter_pos_x_inches46
intercept_ball_minus_batter_pos_y_inches46
babip13
xba13
launch_speed13
launch_angle13
hardhit_percent13
barrels_per_bbe_percent13
barrels_per_pa_percent13
barrels_total13
xslg13
xbadiff13
xslgdiff13
xobp12
xobpdiff12
hyper_speed5
bbdist5
swing_miss_percent3
Null rows are removed in a single operation using pandas dropna:
df.dropna(inplace=True)

print(df.shape)
# (857, 76)
The 46-row overlap in swing mechanics nulls (bat_speed, swing_length, attack_angle, etc.) means most of the dropped rows are missing an entire block of biomechanical measurements. These players typically had very few tracked plate appearances and would not provide reliable signal for the classifiers.

Sample Records

Loading the dataset and previewing the first few rows is straightforward with pandas:
import pandas as pd

df = pd.read_csv("Data/MLBDATA.csv")
df.dropna(inplace=True)

# Preview key columns for the first five batters
cols_preview = ["player_name", "woba", "xwoba", "hardhit_percent", "barrels_total", "hits"]
print(df[cols_preview].head())
Example output (first five rows of the clean dataset):
        player_name   woba  xwoba  hardhit_percent  barrels_total  hits
0       Olson, Matt  0.373  0.366        52.219714          192.0   490
1   Schwarber, Kyle  0.369  0.383        54.763949          206.0   402
2        Soto, Juan  0.402  0.433        55.925926          230.0   474
3  Lindor, Francisco 0.353  0.357        45.344130          162.0   494
4  Arozarena, Randy  0.333  0.332        47.642680          130.0   406

Top Players by xwOBA

xwOBA (expected weighted on-base average) is the dataset’s primary quality-of-contact metric, and it anchors the Performance and Elite classification models. The top five batters by xwOBA in the dataset represent some of the most dominant hitters in the sport across the 2023–2025 seasons:
PlayerxwOBA
Aaron Judge0.469
Shohei Ohtani0.433
Juan Soto0.433
Ronald Acuña Jr.0.424
Yordan Alvarez0.419
Aaron Judge’s xwOBA of 0.469 is the highest in the dataset among players with a substantial sample size (457 hits over the aggregated seasons). A handful of players at the very top of the xwOBA leaderboard — such as Jeter Downs (0.523) and Zack Collins (0.519) — have extremely small sample sizes (1–2 hits), making their figures statistically unreliable. The classifier models are applied to the full 857-batter post-clean dataset, so sample-size effects are handled implicitly through the wOBA quantile splits used for labeling.

Build docs developers (and LLMs) love