This guide walks you through everything you need to go from a fresh clone to a fully running classification pipeline. By the end, you will have loaded the 911-batter Statcast dataset, engineered the three target variables, trained all three Random Forest classifiers, evaluated their performance, and generated feature importance charts — all inside a single Jupyter notebook.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Stronauta/MLB-Performance-Analytics/llms.txt
Use this file to discover all available pages before exploring further.
Steps
Set Up a Python Environment
Create an isolated environment to avoid dependency conflicts. Choose either virtualenv or conda:Using venv (built-in):Using conda:
Install Dependencies
The notebook imports
pandas, numpy, matplotlib, seaborn, and scikit-learn. Install them all at once, along with Jupyter:Configure the Data Path
The notebook contains a hardcoded Windows absolute path that must be updated before the notebook will run on any other machine.Open the notebook and replace the The simplest option is the relative path
The first code cell in
MLBStats.ipynb sets ruta = r"D:\Python\Modulo 6\Databases\MLBDATA.csv". This path does not exist outside the original development machine. You must replace it with a path that points to Data/MLBDATA.csv inside the cloned repository.ruta variable in the first code cell:"../Data/MLBDATA.csv", which works correctly when Jupyter is launched from the repository root and the notebook is opened from the Notebook/ subdirectory.Run the Notebook
Launch Jupyter from the repository root:Once the notebook opens in your browser, run all cells in order with Kernel → Restart & Run All. The full pipeline — data loading, cleaning, feature engineering, model training, evaluation, and visualization — executes end-to-end without any additional input required.
What Runs in the Notebook
The notebook executes a complete machine learning pipeline from raw data to evaluated models: 1. Data Loading & Inspection The CSV is loaded into a Pandas DataFrame (df.shape confirms 911 rows × 76 columns). Basic info() and head() calls verify column types and check for nulls across all 76 Statcast features.
2. Exploratory Data Analysis
Seaborn and Matplotlib visualizations explore distributions of key metrics (wOBA, launch speed, bat speed, walk rate, strikeout rate) and their relationships — including a scatterplot comparing plate discipline scores against overall offensive performance.
3. Target Variable Engineering
Three classification targets are constructed directly from dataset columns:
Rendimiento_labels—pd.qcutonwobainto Low / Medium / High thirdselite_hitter— binary flag for batters at or above the 80th percentile ofwobaclase_disciplina_home—pd.cuton a compositedisciplina_en_homescore (walk rate ×0.30, takes ×0.20, plate appearances ×0.10, minus strikeout rate ×0.20 and whiff rate ×0.10) into Baja / Media / Alta
RandomForestClassifier trained on an 80/20 train/test split (stratified for the elite and discipline models):
- Overall Performance model:
RandomForestClassifier(n_estimators=300, max_depth=8, random_state=42) - Elite Status model:
RandomForestClassifier(n_estimators=300, max_depth=8, class_weight='balanced', random_state=42) - Plate Discipline model:
RandomForestClassifier(n_estimators=300, max_depth=8, class_weight='balanced', random_state=42)trained on the discipline-specific feature subset (bb_percent,k_percent,swing_miss_percent,swings,whiffs,takes,pa)
classification_report. Results are printed per-class so performance on minority classes (e.g., elite batters) is clearly visible.
6. Feature Importance Visualization
rf.feature_importances_ is extracted from each trained model and plotted as a horizontal bar chart, revealing which Statcast metrics most strongly drive each classification — for example, woba and xwoba for overall performance, and bb_percent and whiffs for plate discipline.