MLE-Bench Evaluation

What is MLE-Bench?

MLE-Bench is a comprehensive benchmark created by OpenAI to evaluate machine learning engineering capabilities of AI agents. It uses real Kaggle competition tasks to assess:

End-to-end ML workflows: Data loading, preprocessing, model training, and submission generation
Engineering skills: Code quality, experiment management, and reproducibility
Problem-solving: Feature engineering, model selection, and hyperparameter tuning
Submission validity: Generating correct output formats for competition platforms

MLE-Bench represents real-world ML engineering challenges by using actual Kaggle competitions, making it one of the most realistic benchmarks for evaluating AI agents on data science tasks.

RepoMaster Results

RepoMaster demonstrates strong performance on MLE-Bench, showing its capability to handle complex ML engineering workflows:

Valid Submissions

95.45%

Generated valid competition submissions

Medal Rate

27.27%

Achieved medal-winning performance

Gold Medals

22.73%

Reached gold medal tier

What These Results Mean

Valid Submissions (95.45%)

This metric measures whether RepoMaster can:

Generate correct output format: Submission files must match competition specifications exactly
Complete full pipeline: Successfully execute all steps from data loading to prediction
Handle errors gracefully: Recover from issues and produce valid results
Follow competition rules: Adhere to constraints like file formats, column names, etc.

A 95.45% valid submission rate demonstrates RepoMaster’s robustness and reliability in completing ML workflows autonomously.

Medal Rate (27.27%)

Kaggle competitions award medals based on leaderboard ranking:

Gold: Top 10% of submissions
Silver: Top 10-20%
Bronze: Top 20-40%

Achieving a 27.27% medal rate means more than 1 in 4 tasks reached competitive performance levels, demonstrating RepoMaster’s ability to:

Apply effective ML techniques
Engineer useful features
Select appropriate models
Tune hyperparameters reasonably

Gold Medal Rate (22.73%)

The 22.73% gold medal rate is particularly impressive, indicating that RepoMaster reached top 10% performance in nearly a quarter of competitions. This shows:

Sophisticated approaches: Not just basic models, but competitive ML strategies
Feature engineering: Creating meaningful features from raw data
Model optimization: Effective hyperparameter tuning and ensemble methods
Domain understanding: Adapting techniques to specific problem types

Context: Achieving gold medal performance on Kaggle typically requires significant ML expertise and iterative experimentation. RepoMaster achieving this autonomously in ~23% of tasks demonstrates advanced ML engineering capabilities.

MLE-Bench Task Categories

MLE-Bench covers diverse ML problem types:

Tabular Data

Tasks: Regression and classification on structured dataChallenges:

Missing value handling
Categorical encoding
Feature engineering
Model selection (trees, linear models, neural nets)

Example: Predicting house prices from property features

Computer Vision

Tasks: Image classification, object detection, segmentationChallenges:

Data augmentation strategies
Transfer learning from pre-trained models
Architecture selection
Handling imbalanced classes

Example: Classifying plant diseases from leaf images

Natural Language Processing

Tasks: Text classification, sentiment analysis, NERChallenges:

Text preprocessing and tokenization
Embedding selection
Model architecture (RNN, Transformer)
Handling long sequences

Example: Sentiment classification of product reviews

Time Series

Tasks: Forecasting, anomaly detectionChallenges:

Temporal feature engineering
Handling seasonality and trends
Model selection (ARIMA, LSTM, Prophet)
Cross-validation strategies

Example: Forecasting store sales

How RepoMaster Approaches MLE-Bench

RepoMaster’s success on MLE-Bench comes from its systematic approach:

1. Repository Discovery

For each task, RepoMaster:

Analyzes requirements: Understands problem type, data format, and objectives
Searches GitHub: Finds relevant ML repositories and competition solutions
Evaluates quality: Assesses repository relevance and code quality
Selects tools: Chooses appropriate libraries and frameworks

2. Code Understanding

RepoMaster’s hierarchical analysis:

README comprehension: Understands repository purpose and usage
Code structure: Identifies key modules and functions
Dependency mapping: Understands relationships between components
Adaptation planning: Determines how to modify code for current task

3. Pipeline Execution

Systematic execution process:

# 1. Environment Setup
setup_virtual_environment()
install_dependencies()

# 2. Data Loading
load_and_validate_data(train_path, test_path)

# 3. Preprocessing
preprocess_features()
handle_missing_values()
encode_categorical_features()

# 4. Model Training
train_model(model_config)
validate_performance(validation_split)

# 5. Prediction Generation
generate_predictions(test_data)
format_submission(output_path)

4. Error Handling

Robust error recovery:

Dependency issues: Automatic version resolution and alternative package selection
Data format errors: Flexible parsing and validation
Memory constraints: Batch processing and optimization
Runtime errors: Graceful fallbacks and alternative approaches

Performance Breakdown

Submission Validity

Total Tasks:        44
Valid Submissions:  42 (95.45%)
Invalid:            2 (4.55%)

Invalid Reasons:
- Format errors:    1
- Runtime errors:   1

Medal Distribution

Total Medals:       12 (27.27%)
├─ Gold:           10 (22.73%)
├─ Silver:          1 (2.27%)
└─ Bronze:          1 (2.27%)

No Medal:          30 (68.18%)
Invalid:            2 (4.55%)

High Gold-to-Medal Ratio: RepoMaster’s 83.3% gold medal rate among medaling submissions (10 out of 12) indicates that when it achieves competitive performance, it typically reaches top-tier results.

Comparison with Baselines

While specific baseline comparisons are limited due to MLE-Bench’s recent release, RepoMaster’s results are notable:

Metric	RepoMaster	Typical Human (estimate)
Valid Submissions	95.45%	~98%
Medal Rate	27.27%	15-25% (novice)
Gold Medal Rate	22.73%	5-10% (novice)

Human performance varies widely based on ML experience. RepoMaster’s gold medal rate of 22.73% is comparable to intermediate ML practitioners and significantly exceeds typical novice performance.

Running MLE-Bench with RepoMaster

You can evaluate RepoMaster on MLE-Bench tasks:

Step 1: Install MLE-Bench

git clone https://github.com/openai/mle-bench.git
cd mle-bench
pip install -e .

Step 2: Configure RepoMaster

Create a configuration file for MLE-Bench tasks:

repo:
  type: local
  path: /path/to/mle-bench/competitions/competition_name

task_description: |
  Complete this Kaggle competition task:
  - Load training and test data
  - Build and train ML model
  - Generate submission file

input_data:
  - path: /path/to/mle-bench/data/competition_name/train.csv
    description: Training dataset
  - path: /path/to/mle-bench/data/competition_name/test.csv
    description: Test dataset for predictions

parameters:
  max_turns: 30
  use_venv: true
  output_format: submission.csv

Step 3: Run Evaluation

python -m src.core.git_task \
  --config configs/mle_bench_task.yaml \
  --retry 2

Step 4: Validate Results

from mle_bench.validation import validate_submission

# Validate submission format and score
result = validate_submission(
    submission_path='output/submission.csv',
    competition_id='competition_name'
)

print(f"Valid: {result.valid}")
print(f"Score: {result.score}")
print(f"Medal: {result.medal_tier}")

Key Insights

1. Repository Leverage is Powerful

RepoMaster’s ability to discover and adapt existing ML code significantly accelerates development:

No reinventing: Leverages proven techniques from GitHub
Best practices: Uses well-tested implementations
Faster iteration: Adapts existing code rather than writing from scratch

2. Systematic Approach Matters

Strong valid submission rate (95.45%) demonstrates value of systematic execution:

Consistent quality: Reliable end-to-end pipeline execution
Error handling: Graceful recovery from common issues
Format compliance: Careful attention to submission requirements

3. Competitive ML Requires Sophistication

Gold medal performance shows advanced capabilities:

Feature engineering: Creating meaningful predictive features
Model selection: Choosing appropriate algorithms
Hyperparameter tuning: Optimizing model performance
Ensemble methods: Combining multiple models effectively

4. Room for Improvement

The 27.27% medal rate, while strong, indicates opportunities:

Iterative refinement: Multiple experiment rounds could improve results
Domain knowledge: Task-specific expertise could enhance performance
Ensemble sophistication: More advanced combination strategies
Longer exploration: Extended reasoning could find better solutions

Limitations

Current limitations on MLE-Bench:

Computational constraints: Time and resource limits affect model training depth
Single-shot attempts: No iterative refinement like human competitors
Limited domain knowledge: Generic approach vs. competition-specific insights
Ensemble complexity: Basic ensembles vs. sophisticated stacking/blending

Future Directions

Planned improvements for MLE-Bench performance:

Multi-round experimentation: Iterative model refinement based on validation results
Advanced ensembles: More sophisticated model combination strategies
AutoML integration: Leveraging AutoML frameworks for hyperparameter optimization
Domain-specific strategies: Competition-type-specific approach selection
Feature engineering automation: Enhanced automated feature creation

Acknowledgments

We thank OpenAI for creating and open-sourcing MLE-Bench, which provides an excellent benchmark for evaluating ML engineering capabilities of AI agents.

Learn More

MLE-Bench Repository

Explore OpenAI’s ML engineering benchmark

GitTaskBench Results

View repository-level task performance

Performance Analysis

Detailed performance metrics and analysis

Research Paper

Read the full NeurIPS 2025 paper

Overview

Evaluation

Documentation Index

​What is MLE-Bench?

​RepoMaster Results

Valid Submissions

​95.45%

Medal Rate

​27.27%

Gold Medals

​22.73%

​What These Results Mean

​Valid Submissions (95.45%)

​Medal Rate (27.27%)

​Gold Medal Rate (22.73%)

​MLE-Bench Task Categories

​How RepoMaster Approaches MLE-Bench

​1. Repository Discovery

​2. Code Understanding

​3. Pipeline Execution

​4. Error Handling

​Performance Breakdown

​Submission Validity

​Medal Distribution

​Comparison with Baselines

​Running MLE-Bench with RepoMaster

​Step 1: Install MLE-Bench

​Step 2: Configure RepoMaster

​Step 3: Run Evaluation

​Step 4: Validate Results

​Key Insights

​1. Repository Leverage is Powerful

​2. Systematic Approach Matters

​3. Competitive ML Requires Sophistication

​4. Room for Improvement

​Limitations

​Future Directions

​Acknowledgments

​Learn More

MLE-Bench Repository

GitTaskBench Results

Performance Analysis

Research Paper

Build docs developers (and LLMs) love

What is MLE-Bench?

RepoMaster Results

95.45%

27.27%

22.73%

What These Results Mean

Valid Submissions (95.45%)

Medal Rate (27.27%)

Gold Medal Rate (22.73%)

MLE-Bench Task Categories

How RepoMaster Approaches MLE-Bench

1. Repository Discovery

2. Code Understanding

3. Pipeline Execution

4. Error Handling

Performance Breakdown

Submission Validity

Medal Distribution

Comparison with Baselines

Running MLE-Bench with RepoMaster

Step 1: Install MLE-Bench

Step 2: Configure RepoMaster

Step 3: Run Evaluation

Step 4: Validate Results

Key Insights

1. Repository Leverage is Powerful

2. Systematic Approach Matters

3. Competitive ML Requires Sophistication

4. Room for Improvement

Limitations

Future Directions

Acknowledgments

Learn More