Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt

Use this file to discover all available pages before exploring further.

MLB Performance Analytics is a machine learning project that applies Random Forest classifiers to MLB Statcast data to identify batting patterns and classify player performance. Built on data from 2023–2025 regular seasons, it covers 911 batters across 76 Statcast features and answers three core questions: how good is this batter overall, are they elite, and how disciplined are they at the plate?

Introduction

Understand the project goals, methodology, and the three classification targets

Quickstart

Set up your environment and run the notebook in a few steps

Dataset

Explore the 76-feature Statcast dataset sourced from Baseball Savant

Feature Engineering

See how raw Statcast metrics are transformed into model-ready features

Target Variables

Learn how the three classification labels are derived from wOBA and plate metrics

Random Forest Models

Review model architecture, hyperparameters, and training setup

Model Evaluation

Accuracy, precision, recall, and F1 scores for each classifier

Results

Explore feature importance findings and top-ranked players per category

What This Project Covers

1

Load and Clean Statcast Data

Import the MLBDATA.csv dataset containing 911 batters and 76 Statcast features. Drop rows with missing values (reducing to 857 clean records).
2

Engineer Three Target Variables

Create Rendimiento_labels (Low/Medium/High performance via wOBA quantiles), elite_hitter (top 20% wOBA), and plate_discipline (strike-zone decision quality).
3

Train Random Forest Classifiers

Fit a separate RandomForestClassifier for each target — 300 trees, max depth 8, with class balancing for the elite model.
4

Analyze Feature Importance

Identify which Statcast metrics most strongly predict each classification outcome. xwOBA and hits dominate the performance model; xwOBA and barrels drive the elite model.
This project was developed as an academic capstone for Módulo 6 of a Data Science & Machine Learning program. All data comes from MLB Baseball Savant and covers the 2023–2025 regular seasons.

Build docs developers (and LLMs) love