This dataset forms the foundation of the MLB Performance Analytics project. It captures 911 MLB batters across 76 Statcast features, aggregated over the 2023, 2024, and 2025 regular seasons. Each row represents one player’s cumulative performance profile, blending traditional counting stats with modern Statcast-derived metrics such as expected weighted on-base average (xwOBA), exit velocity, barrel rates, and granular swing mechanics. The breadth of features makes it well-suited for multi-dimensional classification of batter performance using Random Forest models.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
Source
The data was retrieved from Baseball Savant Statcast Search — the official MLB platform for granular pitch-by-pitch data. The search was configured to pull regular season data for the 2023, 2024, and 2025 seasons, grouped by batter name (group_by=name) so that each row aggregates all plate appearances for a given player across those seasons.
Dataset Shape
The raw CSV contains 911 rows (one per batter) and 76 columns (features). After dropping all rows with any missing values, the working dataset is reduced to 857 rows.df.info() confirms the column types and null counts across all 76 features:
The dataset is composed of 51
float64 columns, 24 int64 columns, and 1 object column (player_name). The majority of numeric columns are complete across all 911 rows; missing values are concentrated in Statcast-era batting mechanics metrics.Feature Categories
The 76 columns span seven distinct categories of baseball metrics. Expand each accordion below to see the full list and a brief description.Player Identification
Player Identification
These two columns uniquely identify each batter in the dataset. They are excluded from model training features but are used for filtering, display, and EDA.
| Column | Description |
|---|---|
player_id | Unique MLB/Statcast numeric identifier for the batter |
player_name | Player’s full name in Last, First format (e.g., Judge, Aaron) |
Traditional Batting Stats
Traditional Batting Stats
Classic counting and rate stats that have been core to baseball analysis for over a century. These form the baseline layer of each batter’s profile.
| Column | Description |
|---|---|
ba | Batting average (H / AB) |
iso | Isolated power (SLG − BA); measures raw extra-base power |
babip | Batting average on balls in play; indicator of luck and defense |
slg | Slugging percentage (total bases / AB) |
obp | On-base percentage |
hits | Total hits across the aggregated seasons |
abs | Total at-bats |
singles | Total singles |
doubles | Total doubles |
triples | Total triples |
hrs | Total home runs |
so | Total strikeouts |
bb | Total walks (bases on balls) |
k_percent | Strikeout rate as a percentage of plate appearances |
bb_percent | Walk rate as a percentage of plate appearances |
pa | Total plate appearances |
bip | Balls put in play |
Advanced Weighted Metrics
Advanced Weighted Metrics
Statcast-derived expected and weighted metrics that capture the true offensive value of a batter, adjusting for luck, defense, and park factors.
| Column | Description |
|---|---|
woba | Weighted on-base average; single-number offensive value metric |
xwoba | Expected wOBA based on contact quality (exit velocity + launch angle) |
xba | Expected batting average |
xobp | Expected on-base percentage |
xslg | Expected slugging percentage |
wobadiff | Difference between actual wOBA and xwOBA (luck/defense signal) |
xbadiff | Difference between actual BA and xBA |
xobpdiff | Difference between actual OBP and xOBP |
xslgdiff | Difference between actual SLG and xSLG |
batter_run_value_per_100 | Run value generated by the batter per 100 pitches |
pitcher_run_value_per_100 | Run value against the pitchers faced per 100 pitches |
barrels_total | Total number of barreled balls (optimal exit velocity + angle) |
barrels_per_bbe_percent | Barrel rate per batted ball event (%) |
barrels_per_pa_percent | Barrel rate per plate appearance (%) |
Contact Quality / Exit Velocity
Contact Quality / Exit Velocity
Statcast measurements of how hard and at what angle the batter makes contact — the physical output of each swing that makes contact.
| Column | Description |
|---|---|
launch_speed | Average exit velocity (mph) on batted balls |
launch_angle | Average launch angle (degrees) on batted balls |
hardhit_percent | Percentage of batted balls hit at 95+ mph exit velocity |
hyper_speed | Extreme exit velocity metric for the hardest-hit balls |
bbdist | Average batted ball distance (feet) |
bat_speed | Average bat speed at contact (mph) |
swing_length | Average path length of the bat through the swing zone (feet) |
Pitch Tracking
Pitch Tracking
Metrics describing the characteristics of pitches seen by each batter — velocity, movement, spin, and release point data aggregated across all pitches faced.
| Column | Description |
|---|---|
pitches | Total pitches seen |
total_pitches | Total pitches in the broader query window |
pitch_percent | Percentage of total pitches tracked with full Statcast data |
velocity | Average pitch velocity faced (mph) |
effective_speed | Average perceived velocity (adjusted for extension) |
spin_rate | Average spin rate of pitches faced (RPM) |
eff_min_vel | Effective velocity differential from minimum pitch velocity |
release_extension | Average pitcher extension toward home plate (feet) |
release_pos_z | Average vertical release position of pitches faced (feet) |
release_pos_x | Average horizontal release position of pitches faced (feet) |
api_break_z_with_gravity | Total vertical break including gravity (inches) |
api_break_z_induced | Vertical break induced by spin only (inches) |
api_break_x_arm | Horizontal break from arm side (inches) |
api_break_x_batter_in | Horizontal break toward batter (inches) |
Plate Tracking / Swing Mechanics
Plate Tracking / Swing Mechanics
Where the pitch crosses the plate and how the batter responds — covering swing decisions, miss rates, and fine-grained biomechanical swing path measurements.
| Column | Description |
|---|---|
plate_x | Average horizontal pitch location at plate crossing (feet from center) |
plate_z | Average vertical pitch location at plate crossing (feet from ground) |
swings | Total swings taken |
whiffs | Total swings and misses |
takes | Total pitches taken (not swung at) |
swing_miss_percent | Whiff rate: whiffs / swings (%) |
attack_angle | Average upward angle of the bat through the hitting zone (degrees) |
attack_direction | Lateral direction of the swing path (degrees) |
swing_path_tilt | Tilt of the overall swing plane (degrees) |
rate_ideal_attack_angle | Proportion of swings with an optimal attack angle |
arm_angle | Average arm angle of pitchers faced (degrees) |
intercept_ball_minus_batter_pos_x_inches | Horizontal offset between ball path and batter position at contact (inches) |
intercept_ball_minus_batter_pos_y_inches | Vertical offset between ball path and batter position at contact (inches) |
run_exp | Run expectancy added by the batter’s plate appearances |
pitcher_run_exp | Run expectancy from the pitcher’s perspective |
Fielder Positioning
Fielder Positioning
Starting distances (in feet) of each fielder (positions 3–9) at the moment the pitch was delivered, averaged across all tracked plate appearances. These reflect the defensive alignments batter typically face.
| Column | Description |
|---|---|
pos3_int_start_distance | First baseman starting distance (feet) |
pos4_int_start_distance | Second baseman starting distance (feet) |
pos5_int_start_distance | Third baseman starting distance (feet) |
pos6_int_start_distance | Shortstop starting distance (feet) |
pos7_int_start_distance | Left fielder starting distance (feet) |
pos8_int_start_distance | Center fielder starting distance (feet) |
pos9_int_start_distance | Right fielder starting distance (feet) |
Missing Values
Of the 911 raw rows, 54 rows were dropped to produce the clean 857-row working dataset. The null values are concentrated in Statcast biomechanics fields that require a minimum sample threshold to compute reliably. The columns with the most missing values are:| Column | Missing Rows |
|---|---|
bat_speed | 46 |
swing_length | 46 |
attack_angle | 46 |
attack_direction | 46 |
swing_path_tilt | 46 |
rate_ideal_attack_angle | 46 |
intercept_ball_minus_batter_pos_x_inches | 46 |
intercept_ball_minus_batter_pos_y_inches | 46 |
babip | 13 |
xba | 13 |
launch_speed | 13 |
launch_angle | 13 |
hardhit_percent | 13 |
barrels_per_bbe_percent | 13 |
barrels_per_pa_percent | 13 |
barrels_total | 13 |
xslg | 13 |
xbadiff | 13 |
xslgdiff | 13 |
xobp | 12 |
xobpdiff | 12 |
hyper_speed | 5 |
bbdist | 5 |
swing_miss_percent | 3 |
dropna:
Sample Records
Loading the dataset and previewing the first few rows is straightforward with pandas:Top Players by xwOBA
xwOBA (expected weighted on-base average) is the dataset’s primary quality-of-contact metric, and it anchors the Performance and Elite classification models. The top five batters by xwOBA in the dataset represent some of the most dominant hitters in the sport across the 2023–2025 seasons:| Player | xwOBA |
|---|---|
| Aaron Judge | 0.469 |
| Shohei Ohtani | 0.433 |
| Juan Soto | 0.433 |
| Ronald Acuña Jr. | 0.424 |
| Yordan Alvarez | 0.419 |
Aaron Judge’s xwOBA of 0.469 is the highest in the dataset among players with a substantial sample size (457 hits over the aggregated seasons). A handful of players at the very top of the xwOBA leaderboard — such as Jeter Downs (0.523) and Zack Collins (0.519) — have extremely small sample sizes (1–2 hits), making their figures statistically unreliable. The classifier models are applied to the full 857-batter post-clean dataset, so sample-size effects are handled implicitly through the wOBA quantile splits used for labeling.