MLB Performance Analytics is a machine learning project that applies Random Forest classifiers to MLB Statcast data to identify batting patterns and classify player performance. Built on data from 2023–2025 regular seasons, it covers 911 batters across 76 Statcast features and answers three core questions: how good is this batter overall, are they elite, and how disciplined are they at the plate?Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Understand the project goals, methodology, and the three classification targets
Quickstart
Set up your environment and run the notebook in a few steps
Dataset
Explore the 76-feature Statcast dataset sourced from Baseball Savant
Feature Engineering
See how raw Statcast metrics are transformed into model-ready features
Target Variables
Learn how the three classification labels are derived from wOBA and plate metrics
Random Forest Models
Review model architecture, hyperparameters, and training setup
Model Evaluation
Accuracy, precision, recall, and F1 scores for each classifier
Results
Explore feature importance findings and top-ranked players per category
What This Project Covers
Load and Clean Statcast Data
Import the
MLBDATA.csv dataset containing 911 batters and 76 Statcast features. Drop rows with missing values (reducing to 857 clean records).Engineer Three Target Variables
Create
Rendimiento_labels (Low/Medium/High performance via wOBA quantiles), elite_hitter (top 20% wOBA), and plate_discipline (strike-zone decision quality).Train Random Forest Classifiers
Fit a separate
RandomForestClassifier for each target — 300 trees, max depth 8, with class balancing for the elite model.This project was developed as an academic capstone for Módulo 6 of a Data Science & Machine Learning program. All data comes from MLB Baseball Savant and covers the 2023–2025 regular seasons.