What is MLE-Bench?
MLE-Bench is a comprehensive benchmark created by OpenAI to evaluate machine learning engineering capabilities of AI agents. It uses real Kaggle competition tasks to assess:- End-to-end ML workflows: Data loading, preprocessing, model training, and submission generation
- Engineering skills: Code quality, experiment management, and reproducibility
- Problem-solving: Feature engineering, model selection, and hyperparameter tuning
- Submission validity: Generating correct output formats for competition platforms
MLE-Bench represents real-world ML engineering challenges by using actual Kaggle competitions, making it one of the most realistic benchmarks for evaluating AI agents on data science tasks.
RepoMaster Results
RepoMaster demonstrates strong performance on MLE-Bench, showing its capability to handle complex ML engineering workflows:Valid Submissions
95.45%
Generated valid competition submissionsMedal Rate
27.27%
Achieved medal-winning performanceGold Medals
22.73%
Reached gold medal tierWhat These Results Mean
Valid Submissions (95.45%)
This metric measures whether RepoMaster can:- Generate correct output format: Submission files must match competition specifications exactly
- Complete full pipeline: Successfully execute all steps from data loading to prediction
- Handle errors gracefully: Recover from issues and produce valid results
- Follow competition rules: Adhere to constraints like file formats, column names, etc.
Medal Rate (27.27%)
Kaggle competitions award medals based on leaderboard ranking:- Gold: Top 10% of submissions
- Silver: Top 10-20%
- Bronze: Top 20-40%
- Apply effective ML techniques
- Engineer useful features
- Select appropriate models
- Tune hyperparameters reasonably
Gold Medal Rate (22.73%)
The 22.73% gold medal rate is particularly impressive, indicating that RepoMaster reached top 10% performance in nearly a quarter of competitions. This shows:- Sophisticated approaches: Not just basic models, but competitive ML strategies
- Feature engineering: Creating meaningful features from raw data
- Model optimization: Effective hyperparameter tuning and ensemble methods
- Domain understanding: Adapting techniques to specific problem types
Context: Achieving gold medal performance on Kaggle typically requires significant ML expertise and iterative experimentation. RepoMaster achieving this autonomously in ~23% of tasks demonstrates advanced ML engineering capabilities.
MLE-Bench Task Categories
MLE-Bench covers diverse ML problem types:Tabular Data
Tabular Data
Tasks: Regression and classification on structured dataChallenges:
- Missing value handling
- Categorical encoding
- Feature engineering
- Model selection (trees, linear models, neural nets)
Computer Vision
Computer Vision
Tasks: Image classification, object detection, segmentationChallenges:
- Data augmentation strategies
- Transfer learning from pre-trained models
- Architecture selection
- Handling imbalanced classes
Natural Language Processing
Natural Language Processing
Tasks: Text classification, sentiment analysis, NERChallenges:
- Text preprocessing and tokenization
- Embedding selection
- Model architecture (RNN, Transformer)
- Handling long sequences
Time Series
Time Series
Tasks: Forecasting, anomaly detectionChallenges:
- Temporal feature engineering
- Handling seasonality and trends
- Model selection (ARIMA, LSTM, Prophet)
- Cross-validation strategies
How RepoMaster Approaches MLE-Bench
RepoMaster’s success on MLE-Bench comes from its systematic approach:1. Repository Discovery
For each task, RepoMaster:- Analyzes requirements: Understands problem type, data format, and objectives
- Searches GitHub: Finds relevant ML repositories and competition solutions
- Evaluates quality: Assesses repository relevance and code quality
- Selects tools: Chooses appropriate libraries and frameworks
2. Code Understanding
RepoMaster’s hierarchical analysis:- README comprehension: Understands repository purpose and usage
- Code structure: Identifies key modules and functions
- Dependency mapping: Understands relationships between components
- Adaptation planning: Determines how to modify code for current task
3. Pipeline Execution
Systematic execution process:4. Error Handling
Robust error recovery:- Dependency issues: Automatic version resolution and alternative package selection
- Data format errors: Flexible parsing and validation
- Memory constraints: Batch processing and optimization
- Runtime errors: Graceful fallbacks and alternative approaches
Performance Breakdown
Submission Validity
Medal Distribution
High Gold-to-Medal Ratio: RepoMaster’s 83.3% gold medal rate among medaling submissions (10 out of 12) indicates that when it achieves competitive performance, it typically reaches top-tier results.
Comparison with Baselines
While specific baseline comparisons are limited due to MLE-Bench’s recent release, RepoMaster’s results are notable:| Metric | RepoMaster | Typical Human (estimate) |
|---|---|---|
| Valid Submissions | 95.45% | ~98% |
| Medal Rate | 27.27% | 15-25% (novice) |
| Gold Medal Rate | 22.73% | 5-10% (novice) |
Human performance varies widely based on ML experience. RepoMaster’s gold medal rate of 22.73% is comparable to intermediate ML practitioners and significantly exceeds typical novice performance.
Running MLE-Bench with RepoMaster
You can evaluate RepoMaster on MLE-Bench tasks:Step 1: Install MLE-Bench
Step 2: Configure RepoMaster
Create a configuration file for MLE-Bench tasks:Step 3: Run Evaluation
Step 4: Validate Results
Key Insights
1. Repository Leverage is Powerful
RepoMaster’s ability to discover and adapt existing ML code significantly accelerates development:- No reinventing: Leverages proven techniques from GitHub
- Best practices: Uses well-tested implementations
- Faster iteration: Adapts existing code rather than writing from scratch
2. Systematic Approach Matters
Strong valid submission rate (95.45%) demonstrates value of systematic execution:- Consistent quality: Reliable end-to-end pipeline execution
- Error handling: Graceful recovery from common issues
- Format compliance: Careful attention to submission requirements
3. Competitive ML Requires Sophistication
Gold medal performance shows advanced capabilities:- Feature engineering: Creating meaningful predictive features
- Model selection: Choosing appropriate algorithms
- Hyperparameter tuning: Optimizing model performance
- Ensemble methods: Combining multiple models effectively
4. Room for Improvement
The 27.27% medal rate, while strong, indicates opportunities:- Iterative refinement: Multiple experiment rounds could improve results
- Domain knowledge: Task-specific expertise could enhance performance
- Ensemble sophistication: More advanced combination strategies
- Longer exploration: Extended reasoning could find better solutions
Limitations
Current limitations on MLE-Bench:- Computational constraints: Time and resource limits affect model training depth
- Single-shot attempts: No iterative refinement like human competitors
- Limited domain knowledge: Generic approach vs. competition-specific insights
- Ensemble complexity: Basic ensembles vs. sophisticated stacking/blending
Future Directions
Planned improvements for MLE-Bench performance:- Multi-round experimentation: Iterative model refinement based on validation results
- Advanced ensembles: More sophisticated model combination strategies
- AutoML integration: Leveraging AutoML frameworks for hyperparameter optimization
- Domain-specific strategies: Competition-type-specific approach selection
- Feature engineering automation: Enhanced automated feature creation
Acknowledgments
We thank OpenAI for creating and open-sourcing MLE-Bench, which provides an excellent benchmark for evaluating ML engineering capabilities of AI agents.Learn More
MLE-Bench Repository
Explore OpenAI’s ML engineering benchmark
GitTaskBench Results
View repository-level task performance
Performance Analysis
Detailed performance metrics and analysis
Research Paper
Read the full NeurIPS 2025 paper