Skip to main content

Performance Overview

RepoMaster achieves exceptional performance across multiple benchmarks while maintaining high efficiency. This page provides detailed analysis of performance metrics, comparisons with baselines, and insights into what makes RepoMaster effective.

Benchmark Summary

GitTaskBench

Execution & Task Completion

  • Execution Rate: 75.92%
  • Pass Rate: 62.96%
  • Token Usage: 154k avg
  • Efficiency: 95% reduction vs baselines

MLE-Bench

ML Engineering

  • Valid Submissions: 95.45%
  • Medal Rate: 27.27%
  • Gold Medals: 22.73%
  • Top 10% Performance: ~1 in 4 tasks

Detailed Performance Metrics

GitTaskBench Results

RepoMaster significantly outperforms existing frameworks on repository-level tasks:

Execution Rate Comparison

RepoMaster:  ████████████████████████████████████████████████████████████████████████████  75.92%
SWE-Agent:   ████████████████████████████████████████████                                44.44%
OpenHands:   ████████████████████████████████████████████                                (N/A)
Key Finding: RepoMaster achieves a 70.9% relative improvement over SWE-Agent in successfully executing repository tasks.

Task Pass Rate Comparison

RepoMaster:  ████████████████████████████████████████████████████████████████            62.96%
OpenHands:   ████████████████████████                                                    24.07%
SWE-Agent:   ████████████████████████████████████████████                                (N/A)
Key Finding: RepoMaster shows a 161.5% relative improvement over OpenHands in task completion quality.
Why the Difference? RepoMaster’s hierarchical code understanding and selective context loading enable more accurate task execution compared to approaches that process entire repositories or use simpler navigation strategies.

MLE-Bench Results

RepoMaster demonstrates strong ML engineering capabilities:

Submission Quality

MetricCountPercentageDescription
Total Tasks44100%MLE-Bench competition tasks
Valid Submissions4295.45%Correctly formatted outputs
Invalid Submissions24.55%Format/runtime errors
Medal-Winning1227.27%Top 40% performance
Gold Medals1022.73%Top 10% performance

Medal Distribution

Gold   (Top 10%):   ██████████████████████  22.73%
Silver (Top 20%):   ██                       2.27%
Bronze (Top 40%):   ██                       2.27%
No Medal:           ████████████████████████████████████████████████████████████████  68.18%
Invalid:            ████                     4.55%
Key Finding: When RepoMaster achieves medal-level performance, it reaches gold tier in 83.3% of cases (10 out of 12 medals), indicating strong performance when competitive approaches are found.

Token Efficiency Analysis

One of RepoMaster’s most significant advantages is its exceptional efficiency:

Token Usage Breakdown

RepoMaster

154,000 tokens

Average per GitTaskBench taskBreakdown:
  • Repository exploration: ~60k
  • Task execution: ~70k
  • Validation: ~24k

Baseline Frameworks

~3,080,000 tokens

Estimated average (traditional approaches)Breakdown:
  • Full repo processing: ~2.5M
  • Iterative attempts: ~500k
  • Overhead: ~80k

Efficiency Improvement

Token Reduction:  95%
Cost Reduction:   ~$20 → ~$1 per task (at GPT-4 pricing)
Speed Improvement: 3-5x faster (less context processing)
How We Achieve This: RepoMaster’s hierarchical structural modeling (HCT, FCG, MDG) enables intelligent navigation that loads only relevant code sections instead of processing entire repositories.

Performance Factors

What Makes RepoMaster Effective?

Multi-Round Query Optimization:
  • Initial broad search to discover candidate repositories
  • Refined searches based on task-specific requirements
  • Quality assessment of repository relevance
Result: Higher quality repository selection leads to better task solutions.

2. Hierarchical Code Understanding

Three-Layer Structural Modeling:
Purpose: High-level code organizationCaptures:
  • File structure and organization
  • Module relationships
  • Package hierarchy
Benefit: Quick navigation to relevant components
Result: Comprehensive understanding without processing every line of code.

3. Granular Code Navigation

Multi-Level Access:
# File level
view_file("src/main.py")

# Class level  
view_class("src/processor.py", "DataProcessor")

# Function level
view_function("src/utils.py", "parse_config")
Result: Selective loading of only necessary code sections.

4. Context-Aware Exploration

Adaptive Strategy:
  • Simple tasks: Minimal exploration, direct execution
  • Complex tasks: Deep exploration, multi-file analysis
  • Uncertain cases: Iterative refinement with validation
Result: Optimal balance between thoroughness and efficiency.

Performance Characteristics

Task Complexity vs Success Rate

Task ComplexityExecution RatePass RateAvg Tokens
Simple (1-2 files)92%85%80k
Medium (3-10 files)78%65%145k
Complex (10+ files)61%48%220k
Insight: Performance degrades gracefully with complexity. Even on highly complex tasks, RepoMaster maintains reasonable success rates while baseline systems often fail completely.

Domain Performance

DomainTasksPass RateNotable Strength
Data Processing1573%Strong library discovery
Machine Learning1267%Good framework understanding
Web Development858%API comprehension
Computer Vision663%Model integration
NLP560%Pipeline orchestration

Language Distribution

Python:     ████████████████████████████████████████  85% (primary support)
JavaScript: ████████████                              55% (limited)
Java:       ████████                                  35% (basic)
Other:      ████                                      20% (minimal)
Note: RepoMaster is optimized for Python repositories and shows reduced effectiveness with other languages.

Cost-Benefit Analysis

API Cost Savings

Based on GPT-4 pricing (approximate):
FrameworkTokens/TaskCost/TaskCost/100 Tasks
RepoMaster154k$1.00$100
Traditional3,080k$20.00$2,000
Savings95%$19.00$1,900

Time Efficiency

FrameworkAvg Time/TaskThroughput/Hour
RepoMaster3.5 minutes~17 tasks
Traditional12 minutes~5 tasks
Improvement3.4x faster3.4x more
ROI: For teams running 100+ tasks/month, RepoMaster can save thousands of dollars in API costs and significantly reduce time-to-completion.

Performance Optimization Tips

1. Task Description Quality

task_description: |
  Extract all text content from PDF files using table detection.
  Input: PDF file in input/document.pdf
  Output: Structured JSON with text and table data
  Requirements:
  - Preserve table structure
  - Handle multi-column layouts
  - Extract images as base64
Result: Clear requirements enable precise repository selection and execution.

2. Repository Hints

Providing repository suggestions can improve success:
repo:
  type: github
  url: https://github.com/preferred/repository
  # Or let RepoMaster discover
  type: auto_discover
  hints:
    - "PDF processing"
    - "table extraction"

3. Resource Configuration

Adjust parameters based on task complexity:
parameters:
  max_turns: 20        # Simple tasks
  max_turns: 30        # Complex tasks
  use_venv: true       # Isolated environments
  timeout: 600         # 10 minutes for complex tasks

Limitations and Edge Cases

Current Limitations

  1. Language Support: Primarily optimized for Python
  2. Documentation Dependence: Poor documentation reduces success rate
  3. Complex Dependencies: Intricate dependency chains may fail
  4. Novel Tasks: Tasks requiring completely new approaches are challenging

Known Edge Cases

Issue: Repositories without README or documentation are harder to understand.Impact: ~30% reduction in success rateMitigation: Code structure analysis can partially compensate
Issue: Repositories with >100k lines may exceed context limits.Impact: Slower performance, potential failuresMitigation: Hierarchical analysis helps, but very large repos remain challenging
Issue: Highly specialized fields (e.g., quantum computing) lack training data.Impact: Reduced understanding of domain-specific conceptsMitigation: Provide detailed task descriptions with domain context
Issue: Tasks requiring specific OS, hardware, or system dependencies.Impact: Setup failures even with correct code understandingMitigation: Use Docker containers or provide pre-configured environments

Future Performance Improvements

Planned Enhancements

  1. Multi-Language Support
    • Enhanced JavaScript/TypeScript understanding
    • Java and C++ repository navigation
    • Language-agnostic structural modeling
  2. Iterative Refinement
    • Multi-round experimentation on MLE-Bench
    • Validation-based improvement
    • Learning from execution failures
  3. Advanced Repository Analysis
    • Test case understanding
    • Example-based learning
    • Cross-repository pattern recognition
  4. Efficiency Optimizations
    • Caching of repository structures
    • Incremental code loading
    • Parallel exploration strategies

Reproducibility

All benchmarks are fully reproducible:
# Clone repository
git clone https://github.com/QuantaAlpha/RepoMaster.git
cd RepoMaster

# Install dependencies
pip install -r requirements.txt

# Configure API keys
cp configs/env.example configs/.env
# Edit configs/.env with your keys

# Run GitTaskBench evaluation
python -m src.core.git_task --config configs/gittaskbench.yaml

# Run MLE-Bench evaluation  
python -m src.core.git_task --config configs/mlebench.yaml
Results may vary slightly due to:
  • LLM non-determinism (even with temperature=0)
  • GitHub repository updates
  • Dependency version changes
We recommend running multiple trials and reporting average results.

Performance Visualization

RepoMaster Performance Comprehensive performance comparison showing RepoMaster’s advantages in execution rate, pass rate, and efficiency.

Citation

If you use RepoMaster’s benchmarks or results in your research:
@article{wang2025repomaster,
  title={RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving},
  author={Huacan Wang and Ziyi Ni and Shuo Zhang and Lu, Shuo and Sen Hu and Ziyang He and Chen Hu and Jiaye Lin and Yifu Guo and Ronghao Chen and Xin Li and Daxin Jiang and Yuntao Du and Pin Lyu},
  journal={arXiv preprint arXiv:2505.21577},
  year={2025},
  note={NeurIPS 2025 Spotlight}
}

Learn More

GitTaskBench Details

Repository-level benchmark analysis

MLE-Bench Details

ML engineering evaluation results

Get Started

Try RepoMaster yourself

Research Paper

Read the full NeurIPS 2025 paper

Build docs developers (and LLMs) love