Skip to main content

What is MLE-Bench?

MLE-Bench is a comprehensive benchmark created by OpenAI to evaluate machine learning engineering capabilities of AI agents. It uses real Kaggle competition tasks to assess:
  • End-to-end ML workflows: Data loading, preprocessing, model training, and submission generation
  • Engineering skills: Code quality, experiment management, and reproducibility
  • Problem-solving: Feature engineering, model selection, and hyperparameter tuning
  • Submission validity: Generating correct output formats for competition platforms
MLE-Bench represents real-world ML engineering challenges by using actual Kaggle competitions, making it one of the most realistic benchmarks for evaluating AI agents on data science tasks.

RepoMaster Results

RepoMaster demonstrates strong performance on MLE-Bench, showing its capability to handle complex ML engineering workflows:

Valid Submissions

95.45%

Generated valid competition submissions

Medal Rate

27.27%

Achieved medal-winning performance

Gold Medals

22.73%

Reached gold medal tier

What These Results Mean

Valid Submissions (95.45%)

This metric measures whether RepoMaster can:
  • Generate correct output format: Submission files must match competition specifications exactly
  • Complete full pipeline: Successfully execute all steps from data loading to prediction
  • Handle errors gracefully: Recover from issues and produce valid results
  • Follow competition rules: Adhere to constraints like file formats, column names, etc.
A 95.45% valid submission rate demonstrates RepoMaster’s robustness and reliability in completing ML workflows autonomously.

Medal Rate (27.27%)

Kaggle competitions award medals based on leaderboard ranking:
  • Gold: Top 10% of submissions
  • Silver: Top 10-20%
  • Bronze: Top 20-40%
Achieving a 27.27% medal rate means more than 1 in 4 tasks reached competitive performance levels, demonstrating RepoMaster’s ability to:
  • Apply effective ML techniques
  • Engineer useful features
  • Select appropriate models
  • Tune hyperparameters reasonably

Gold Medal Rate (22.73%)

The 22.73% gold medal rate is particularly impressive, indicating that RepoMaster reached top 10% performance in nearly a quarter of competitions. This shows:
  • Sophisticated approaches: Not just basic models, but competitive ML strategies
  • Feature engineering: Creating meaningful features from raw data
  • Model optimization: Effective hyperparameter tuning and ensemble methods
  • Domain understanding: Adapting techniques to specific problem types
Context: Achieving gold medal performance on Kaggle typically requires significant ML expertise and iterative experimentation. RepoMaster achieving this autonomously in ~23% of tasks demonstrates advanced ML engineering capabilities.

MLE-Bench Task Categories

MLE-Bench covers diverse ML problem types:
Tasks: Regression and classification on structured dataChallenges:
  • Missing value handling
  • Categorical encoding
  • Feature engineering
  • Model selection (trees, linear models, neural nets)
Example: Predicting house prices from property features
Tasks: Image classification, object detection, segmentationChallenges:
  • Data augmentation strategies
  • Transfer learning from pre-trained models
  • Architecture selection
  • Handling imbalanced classes
Example: Classifying plant diseases from leaf images
Tasks: Text classification, sentiment analysis, NERChallenges:
  • Text preprocessing and tokenization
  • Embedding selection
  • Model architecture (RNN, Transformer)
  • Handling long sequences
Example: Sentiment classification of product reviews
Tasks: Forecasting, anomaly detectionChallenges:
  • Temporal feature engineering
  • Handling seasonality and trends
  • Model selection (ARIMA, LSTM, Prophet)
  • Cross-validation strategies
Example: Forecasting store sales

How RepoMaster Approaches MLE-Bench

RepoMaster’s success on MLE-Bench comes from its systematic approach:

1. Repository Discovery

For each task, RepoMaster:
  1. Analyzes requirements: Understands problem type, data format, and objectives
  2. Searches GitHub: Finds relevant ML repositories and competition solutions
  3. Evaluates quality: Assesses repository relevance and code quality
  4. Selects tools: Chooses appropriate libraries and frameworks

2. Code Understanding

RepoMaster’s hierarchical analysis:
  • README comprehension: Understands repository purpose and usage
  • Code structure: Identifies key modules and functions
  • Dependency mapping: Understands relationships between components
  • Adaptation planning: Determines how to modify code for current task

3. Pipeline Execution

Systematic execution process:
# 1. Environment Setup
setup_virtual_environment()
install_dependencies()

# 2. Data Loading
load_and_validate_data(train_path, test_path)

# 3. Preprocessing
preprocess_features()
handle_missing_values()
encode_categorical_features()

# 4. Model Training
train_model(model_config)
validate_performance(validation_split)

# 5. Prediction Generation
generate_predictions(test_data)
format_submission(output_path)

4. Error Handling

Robust error recovery:
  • Dependency issues: Automatic version resolution and alternative package selection
  • Data format errors: Flexible parsing and validation
  • Memory constraints: Batch processing and optimization
  • Runtime errors: Graceful fallbacks and alternative approaches

Performance Breakdown

Submission Validity

Total Tasks:        44
Valid Submissions:  42 (95.45%)
Invalid:            2 (4.55%)

Invalid Reasons:
- Format errors:    1
- Runtime errors:   1

Medal Distribution

Total Medals:       12 (27.27%)
├─ Gold:           10 (22.73%)
├─ Silver:          1 (2.27%)
└─ Bronze:          1 (2.27%)

No Medal:          30 (68.18%)
Invalid:            2 (4.55%)
High Gold-to-Medal Ratio: RepoMaster’s 83.3% gold medal rate among medaling submissions (10 out of 12) indicates that when it achieves competitive performance, it typically reaches top-tier results.

Comparison with Baselines

While specific baseline comparisons are limited due to MLE-Bench’s recent release, RepoMaster’s results are notable:
MetricRepoMasterTypical Human (estimate)
Valid Submissions95.45%~98%
Medal Rate27.27%15-25% (novice)
Gold Medal Rate22.73%5-10% (novice)
Human performance varies widely based on ML experience. RepoMaster’s gold medal rate of 22.73% is comparable to intermediate ML practitioners and significantly exceeds typical novice performance.

Running MLE-Bench with RepoMaster

You can evaluate RepoMaster on MLE-Bench tasks:

Step 1: Install MLE-Bench

git clone https://github.com/openai/mle-bench.git
cd mle-bench
pip install -e .

Step 2: Configure RepoMaster

Create a configuration file for MLE-Bench tasks:
repo:
  type: local
  path: /path/to/mle-bench/competitions/competition_name

task_description: |
  Complete this Kaggle competition task:
  - Load training and test data
  - Build and train ML model
  - Generate submission file

input_data:
  - path: /path/to/mle-bench/data/competition_name/train.csv
    description: Training dataset
  - path: /path/to/mle-bench/data/competition_name/test.csv
    description: Test dataset for predictions

parameters:
  max_turns: 30
  use_venv: true
  output_format: submission.csv

Step 3: Run Evaluation

python -m src.core.git_task \
  --config configs/mle_bench_task.yaml \
  --retry 2

Step 4: Validate Results

from mle_bench.validation import validate_submission

# Validate submission format and score
result = validate_submission(
    submission_path='output/submission.csv',
    competition_id='competition_name'
)

print(f"Valid: {result.valid}")
print(f"Score: {result.score}")
print(f"Medal: {result.medal_tier}")

Key Insights

1. Repository Leverage is Powerful

RepoMaster’s ability to discover and adapt existing ML code significantly accelerates development:
  • No reinventing: Leverages proven techniques from GitHub
  • Best practices: Uses well-tested implementations
  • Faster iteration: Adapts existing code rather than writing from scratch

2. Systematic Approach Matters

Strong valid submission rate (95.45%) demonstrates value of systematic execution:
  • Consistent quality: Reliable end-to-end pipeline execution
  • Error handling: Graceful recovery from common issues
  • Format compliance: Careful attention to submission requirements

3. Competitive ML Requires Sophistication

Gold medal performance shows advanced capabilities:
  • Feature engineering: Creating meaningful predictive features
  • Model selection: Choosing appropriate algorithms
  • Hyperparameter tuning: Optimizing model performance
  • Ensemble methods: Combining multiple models effectively

4. Room for Improvement

The 27.27% medal rate, while strong, indicates opportunities:
  • Iterative refinement: Multiple experiment rounds could improve results
  • Domain knowledge: Task-specific expertise could enhance performance
  • Ensemble sophistication: More advanced combination strategies
  • Longer exploration: Extended reasoning could find better solutions

Limitations

Current limitations on MLE-Bench:
  • Computational constraints: Time and resource limits affect model training depth
  • Single-shot attempts: No iterative refinement like human competitors
  • Limited domain knowledge: Generic approach vs. competition-specific insights
  • Ensemble complexity: Basic ensembles vs. sophisticated stacking/blending

Future Directions

Planned improvements for MLE-Bench performance:
  1. Multi-round experimentation: Iterative model refinement based on validation results
  2. Advanced ensembles: More sophisticated model combination strategies
  3. AutoML integration: Leveraging AutoML frameworks for hyperparameter optimization
  4. Domain-specific strategies: Competition-type-specific approach selection
  5. Feature engineering automation: Enhanced automated feature creation

Acknowledgments

We thank OpenAI for creating and open-sourcing MLE-Bench, which provides an excellent benchmark for evaluating ML engineering capabilities of AI agents.

Learn More

MLE-Bench Repository

Explore OpenAI’s ML engineering benchmark

GitTaskBench Results

View repository-level task performance

Performance Analysis

Detailed performance metrics and analysis

Research Paper

Read the full NeurIPS 2025 paper

Build docs developers (and LLMs) love