Performance Overview
RepoMaster achieves exceptional performance across multiple benchmarks while maintaining high efficiency. This page provides detailed analysis of performance metrics, comparisons with baselines, and insights into what makes RepoMaster effective.Benchmark Summary
GitTaskBench
Execution & Task Completion
- Execution Rate: 75.92%
- Pass Rate: 62.96%
- Token Usage: 154k avg
- Efficiency: 95% reduction vs baselines
MLE-Bench
ML Engineering
- Valid Submissions: 95.45%
- Medal Rate: 27.27%
- Gold Medals: 22.73%
- Top 10% Performance: ~1 in 4 tasks
Detailed Performance Metrics
GitTaskBench Results
RepoMaster significantly outperforms existing frameworks on repository-level tasks:Execution Rate Comparison
Task Pass Rate Comparison
Why the Difference? RepoMaster’s hierarchical code understanding and selective context loading enable more accurate task execution compared to approaches that process entire repositories or use simpler navigation strategies.
MLE-Bench Results
RepoMaster demonstrates strong ML engineering capabilities:Submission Quality
| Metric | Count | Percentage | Description |
|---|---|---|---|
| Total Tasks | 44 | 100% | MLE-Bench competition tasks |
| Valid Submissions | 42 | 95.45% | Correctly formatted outputs |
| Invalid Submissions | 2 | 4.55% | Format/runtime errors |
| Medal-Winning | 12 | 27.27% | Top 40% performance |
| Gold Medals | 10 | 22.73% | Top 10% performance |
Medal Distribution
Token Efficiency Analysis
One of RepoMaster’s most significant advantages is its exceptional efficiency:Token Usage Breakdown
RepoMaster
154,000 tokens
Average per GitTaskBench taskBreakdown:- Repository exploration: ~60k
- Task execution: ~70k
- Validation: ~24k
Baseline Frameworks
~3,080,000 tokens
Estimated average (traditional approaches)Breakdown:- Full repo processing: ~2.5M
- Iterative attempts: ~500k
- Overhead: ~80k
Efficiency Improvement
How We Achieve This: RepoMaster’s hierarchical structural modeling (HCT, FCG, MDG) enables intelligent navigation that loads only relevant code sections instead of processing entire repositories.
Performance Factors
What Makes RepoMaster Effective?
1. Intelligent Repository Search
Multi-Round Query Optimization:- Initial broad search to discover candidate repositories
- Refined searches based on task-specific requirements
- Quality assessment of repository relevance
2. Hierarchical Code Understanding
Three-Layer Structural Modeling:- HCT (Hierarchical Code Tree)
- FCG (Function Call Graph)
- MDG (Module Dependency Graph)
Purpose: High-level code organizationCaptures:
- File structure and organization
- Module relationships
- Package hierarchy
3. Granular Code Navigation
Multi-Level Access:4. Context-Aware Exploration
Adaptive Strategy:- Simple tasks: Minimal exploration, direct execution
- Complex tasks: Deep exploration, multi-file analysis
- Uncertain cases: Iterative refinement with validation
Performance Characteristics
Task Complexity vs Success Rate
| Task Complexity | Execution Rate | Pass Rate | Avg Tokens |
|---|---|---|---|
| Simple (1-2 files) | 92% | 85% | 80k |
| Medium (3-10 files) | 78% | 65% | 145k |
| Complex (10+ files) | 61% | 48% | 220k |
Insight: Performance degrades gracefully with complexity. Even on highly complex tasks, RepoMaster maintains reasonable success rates while baseline systems often fail completely.
Domain Performance
| Domain | Tasks | Pass Rate | Notable Strength |
|---|---|---|---|
| Data Processing | 15 | 73% | Strong library discovery |
| Machine Learning | 12 | 67% | Good framework understanding |
| Web Development | 8 | 58% | API comprehension |
| Computer Vision | 6 | 63% | Model integration |
| NLP | 5 | 60% | Pipeline orchestration |
Language Distribution
Cost-Benefit Analysis
API Cost Savings
Based on GPT-4 pricing (approximate):| Framework | Tokens/Task | Cost/Task | Cost/100 Tasks |
|---|---|---|---|
| RepoMaster | 154k | $1.00 | $100 |
| Traditional | 3,080k | $20.00 | $2,000 |
| Savings | 95% | $19.00 | $1,900 |
Time Efficiency
| Framework | Avg Time/Task | Throughput/Hour |
|---|---|---|
| RepoMaster | 3.5 minutes | ~17 tasks |
| Traditional | 12 minutes | ~5 tasks |
| Improvement | 3.4x faster | 3.4x more |
ROI: For teams running 100+ tasks/month, RepoMaster can save thousands of dollars in API costs and significantly reduce time-to-completion.
Performance Optimization Tips
1. Task Description Quality
- Good Example
- Poor Example
2. Repository Hints
Providing repository suggestions can improve success:3. Resource Configuration
Adjust parameters based on task complexity:Limitations and Edge Cases
Current Limitations
- Language Support: Primarily optimized for Python
- Documentation Dependence: Poor documentation reduces success rate
- Complex Dependencies: Intricate dependency chains may fail
- Novel Tasks: Tasks requiring completely new approaches are challenging
Known Edge Cases
Undocumented Repositories
Undocumented Repositories
Issue: Repositories without README or documentation are harder to understand.Impact: ~30% reduction in success rateMitigation: Code structure analysis can partially compensate
Large Codebases
Large Codebases
Issue: Repositories with >100k lines may exceed context limits.Impact: Slower performance, potential failuresMitigation: Hierarchical analysis helps, but very large repos remain challenging
Specialized Domains
Specialized Domains
Issue: Highly specialized fields (e.g., quantum computing) lack training data.Impact: Reduced understanding of domain-specific conceptsMitigation: Provide detailed task descriptions with domain context
Environment Complexity
Environment Complexity
Issue: Tasks requiring specific OS, hardware, or system dependencies.Impact: Setup failures even with correct code understandingMitigation: Use Docker containers or provide pre-configured environments
Future Performance Improvements
Planned Enhancements
-
Multi-Language Support
- Enhanced JavaScript/TypeScript understanding
- Java and C++ repository navigation
- Language-agnostic structural modeling
-
Iterative Refinement
- Multi-round experimentation on MLE-Bench
- Validation-based improvement
- Learning from execution failures
-
Advanced Repository Analysis
- Test case understanding
- Example-based learning
- Cross-repository pattern recognition
-
Efficiency Optimizations
- Caching of repository structures
- Incremental code loading
- Parallel exploration strategies
Reproducibility
All benchmarks are fully reproducible:Results may vary slightly due to:
- LLM non-determinism (even with temperature=0)
- GitHub repository updates
- Dependency version changes
Performance Visualization
Comprehensive performance comparison showing RepoMaster’s advantages in execution rate, pass rate, and efficiency.
Citation
If you use RepoMaster’s benchmarks or results in your research:Learn More
GitTaskBench Details
Repository-level benchmark analysis
MLE-Bench Details
ML engineering evaluation results
Get Started
Try RepoMaster yourself
Research Paper
Read the full NeurIPS 2025 paper