Skip to main content

Overview

RepoMaster’s repository exploration system provides intelligent code analysis through hierarchical understanding and tool-assisted exploration. The system builds a comprehensive model of code structure, dependencies, and semantics.

Architecture

CodeExplorer Initialization

Setup Process

Constructor (agent_code_explore.py:25-75):
class CodeExplorer(BaseCodeExplorer):
    def __init__(
        self,
        local_repo_path: str,          # Repository path
        work_dir: str,                 # Working directory
        remote_repo_path: Optional[str] = None,  # Docker path
        llm_config: Optional[dict] = None,
        code_execution_config: Optional[dict] = None,
        task_type: Optional[str] = None,
        use_venv: bool = False,
        task_id: Optional[str] = None,
        is_cleanup_venv: bool = True,
        args: dict = {}
    ):
        # Call base class for venv management
        super().__init__(work_dir, use_venv, task_id, is_cleanup_venv)
        
        # Configuration
        self.llm_config = get_llm_config('code_explore') if llm_config is None else llm_config
        self.code_execution_config = {"work_dir": work_dir, "use_docker": False} \
                                     if code_execution_config is None else code_execution_config
        
        # Timeout: 2 hours
        self.code_execution_config['timeout'] = 2 * 60 * 60
        
        # Setup tool library and agents
        self._setup_tool_library()
        self._setup_agents()

Tool Library Initialization

Method (agent_code_explore.py:76-95):
def _setup_tool_library(self):
    """Setup tool library"""
    if self.local_repo_path:
        self.code_library = CodeExplorerTools(
            self.local_repo_path,
            work_dir=self.work_dir,
            docker_work_dir=self.docker_path_prefix
        )
    else:
        self.code_library = None
    
    # Generate important modules summary
    if self.local_repo_path and self.args.get("repo_init", True):
        self.code_importance = self.code_library.builder.generate_llm_important_modules(
            max_tokens=8000
        )
    else:
        self.code_importance = ""
Initialization Time: The tool library initialization includes tree-sitter parsing and important module detection, which can take 10-30 seconds for large repositories.

Repository Summary Generation

The generate_repository_summary function creates intelligent summaries of repositories (repo_summary.py:8-112).

Importance-Based Filtering

File Importance Analysis (repo_summary.py:26-67):
def judge_file_is_important(
    code_list: list[dict]
) -> list[dict]:
    """Judge whether files are important for understanding the repository"""
    
    judge_prompt = f"""
    You are an assistant that helps developers understand code repositories.
    Judge whether the current file is important.
    
    Rules:
    1. README.md with repository description - very important
    2. Configuration files, test files, example files - very important
    3. Files with information important for understanding - very important
    4. Duplicate file contents - keep only one
    
    Return JSON list (sorted by importance):
    [
        {{
            "file_path": "path",
            "is_important": "yes" or "no"
        }}
    ]
    """
    
    response_dict = AzureGPT4Chat().chat_with_message(
        messages,
        json_format=True
    )
    
    # Filter to only important files
    out_list = [
        file for judge_result in response_dict
        if judge_result['is_important'].lower() == 'yes'
        for file in code_list
        if judge_result['file_path'] == file['file_path']
    ]
    return out_list

README Summary Extraction

Summary Generation (repo_summary.py:114-158):
def get_readme_summary(code_content: str, history_summary: dict):
    """Extract summary from README and important documentation"""
    
    system_prompt = """
    You are an assistant that helps developers understand code repositories.
    Generate summary based on README and documentation files.
    
    Rules:
    1. Focus on main functions, architecture, and usage
    2. Use <cite>referenced content</cite> for important code blocks
    3. Keep summary concise, comprehensive, and informative
    4. Include installation methods, dependencies, example usage
    5. Ignore disclaimers and unimportant content
    6. Avoid duplicating content from history_summary
    """
    
    prompt = f"""
    README and important documents:
    <code_content>
    {code_content}
    </code_content>
    
    Previous summaries:
    <history_summary>
    {history_summary}
    </history_summary>
    """
    
    response = AzureGPT4Chat().chat_with_message(
        [{"role": "system", "content": system_prompt},
         {"role": "user", "content": prompt}],
        json_format=True
    )
    return response

Token-Aware Processing

Smart Splitting (repo_summary.py:69-83, 85-112):
def split_code_lists(code_list: list[dict]):
    """Split code lists by token count (max 50000 tokens)"""
    max_token = 50000
    out_code_list = []
    split_code_list = []
    
    for file in code_list:
        if get_code_abs_token(str(file)) > max_token:
            continue  # Skip files too large
        
        split_code_list.append(file)
        if get_code_abs_token(json.dumps(split_code_list)) > max_token:
            out_code_list.append(split_code_list)
            split_code_list = []
    
    if split_code_list:
        out_code_list.append(split_code_list)
    
    return out_code_list

# Main summary logic
all_file_content = json.dumps(code_list, ensure_ascii=False)
if get_code_abs_token(all_file_content) < max_important_files_token:
    return code_list  # Small enough, return as-is

# Process in batches
important_files = []
for s_code_list in split_code_lists(code_list):
    important_files.extend(judge_file_is_important(s_code_list))

# Generate summaries for important files
repository_summary = {}
for file in important_files:
    summary = get_readme_summary(file['file_content'], repository_summary)
    if '<none>' not in str(summary).lower():
        if get_code_abs_token(json.dumps(repository_summary) + str(summary)) < max_important_files_token:
            repository_summary[file['file_path']] = summary
        else:
            break  # Token limit reached

Analysis Tools

The CodeExplorerTools class provides comprehensive code analysis capabilities.

Tool Registration

Registered Tools (agent_code_explore.py:235-255):
def _register_tools(self):
    """Register tool functions to the executor agent"""
    register_toolkits(
        [
            self.code_library.list_repository_structure,
            self.code_library.search_keyword_include_code,
            self.code_library.view_class_details,
            self.code_library.view_function_details,
            self.code_library.find_references,
            self.code_library.find_dependencies,
            self.code_library.view_file_content,
            self.issue_solution_search,
            FileEditTool.edit,
        ],
        self.explore,     # Assistant agent
        self.executor,    # User proxy agent
    )

Core Analysis Tools

Structure Listing

list_repository_structure() - Get directory tree and file organization

Keyword Search

search_keyword_include_code(keyword) - Find keyword occurrences in code

Class Details

view_class_details(class_name) - Inspect class structure and methods

Function Details

view_function_details(function_name) - View function signatures and implementations

Reference Finding

find_references(symbol) - Find where symbols are used

Dependency Analysis

find_dependencies(module) - Analyze import relationships

File Viewing

view_file_content(file_path) - Read file contents

Issue Solutions

issue_solution_search(issue) - Find solutions for programming issues
Web-Based Problem Solving (agent_code_explore.py:209-233):
async def issue_solution_search(
    self, 
    issue_description: Annotated[str, "Description of programming issue"]
) -> str:
    """
    Search for solutions to specific programming issues or errors.
    
    Sources: GitHub, Stack Overflow, official docs, forums
    
    Returns:
        String containing solution information with:
        - Concise summary
        - Source URL
        - Source name (e.g., "Stack Overflow", "GitHub Issue")
    """
    query = f"""
    Please search for solutions to the following programming issue:
    <issue_description>
    {issue_description}
    </issue_description>
    
    Steps:
    1. Search the web for solutions, code snippets, or discussions
    2. Prioritize well-explained, reputable, highly-rated solutions
    3. Select up to 3 most relevant solutions
    4. For each solution, provide summary, source URL, source name
    5. Present findings as clear, readable text (use markdown)
    
    If no relevant solutions found, indicate that.
    """
    return await self.issue_searcher.a_web_agent_answer(query)
Smart Search: The issue solution search automatically queries multiple sources and ranks results by relevance and reputation.

Conversation Flow

Analysis Execution

Main Analysis Method (agent_code_explore.py:257-317):
async def analyze_code(self, task: str, max_turns: int = 40) -> str:
    """
    Analyze code repository and complete specific tasks
    
    Args:
        task: User's programming task description
        max_turns: Maximum conversation turns
        
    Returns:
        Analysis results and implementation plan
    """
    # Reset tool call count
    self.task = task
    self.current_tool_call_count = 0
    
    # Set initial message based on task type
    if self.task_type == "general":
        initial_message = CODE_ASSISTANT_PROMPT.format(
            task=task, work_dir=self.work_dir
        )
    else:
        initial_message = USER_EXPLORER_PROMPT.format(
            task=task,
            work_dir=self.work_dir,
            remote_repo_path=self.remote_repo_path,
            code_importance=self.code_importance
        )
    
    # Handle restart with history summary
    history_message_list = []
    if self.is_restart and self.restart_count < 2:
        history_message_list = self.executor.chat_messages.get(self.explore, [])
        initial_message = self.summary_chat_history(task, history_message_list)
        self.restart_count += 1
        self.is_restart = False
        history_message_list = json.loads(initial_message)
    
    # Start conversation
    chat_result = await self.executor.a_initiate_chat(
        self.explore,
        message=initial_message,
        max_turns=max_turns,
        history_message_load=history_message_list
    )
    
    # Handle restart if needed
    if self.is_restart and self.restart_count < 2:
        return await self.analyze_code(task, max_turns)
    
    # Extract final result
    messages = chat_result.chat_history
    final_answer = chat_result.summary.strip().lstrip()
    
    return final_answer

Token Limit Management

Termination Logic (agent_code_explore.py:97-142):
def token_limit_termination(self, msg):
    """Check if token limit is reached to decide termination"""
    
    def check_tool_call(msg):
        if msg.get("tool_calls", []):
            return True
        if msg.get("tool_response", []):
            return True
        return False
    
    content = msg.get("content", "")
    if isinstance(content, str):
        content = content.strip()
    
    # Check original termination conditions
    original_termination = (
        content and 
        (len(content.split("TERMINATE")[-1]) < 3 or 
         len(content.split("<TERMINATE>")[-1]) < 2)
    )
    
    if msg is None:
        return False
    
    if (not check_tool_call(msg)) and (not content):
        return True
    
    # Terminate if original conditions met
    if (original_termination and 
        check_code_block(content) is None and
        not check_tool_call(msg)):
        self.is_restart = False
        return True
    
    # Calculate total token count
    messages = self.executor.chat_messages.get(self.explore, [])
    total_tokens = sum(
        get_code_abs_token(str(m.get("content", "")))
        for m in messages
    )
    
    # If over limit (80000 tokens), trigger restart
    if total_tokens > self.limit_restart_tokens:
        self.is_restart = True
        self.chat_turns += len(messages) - 1
        return True
    
    return False
Token Limits:
  • Token limit for summary: 2000 tokens (agent_code_explore.py:54)
  • Restart token limit: 80000 tokens (agent_code_explore.py:55)
  • Maximum important modules: 8000 tokens (agent_code_explore.py:91)

Task-Specific Prompts

General Code Assistant

Enhanced Task Description (agent_scheduler.py:311-329):
enhanced_task = f"""
You are a general programming assistant. Please help with the following task:

{task_description}

As a programming assistant, you can:
- Write and execute code to solve problems
- Provide programming guidance and explanations  
- Create practical examples and demonstrations
- Debug and troubleshoot issues
- Implement algorithms and data structures
- Explain programming concepts
- Create utility scripts and tools

Working directory: {work_dir}

Please provide comprehensive help including code examples, 
explanations, and practical solutions.
"""

Repository Exploration

Task Prompt Components (git_task.py:171-191):
task_prompt = f"""
### Task Description
{task_description}

#### Repository Path (Absolute Path): 
Understanding Guide: ['Read README.md to understand basic functionality']

#### File Paths
- Input file paths and descriptions:
{input_data}

- Output file directory: 
Results must be saved in the {output_dir_path} directory.

#### Additional Notes
**Core Objective**: Quickly understand and analyze the code repository, 
generate and execute necessary code to efficiently complete user tasks.
"""

Virtual Environment Management

Inherited from BaseCodeExplorer for safe code execution.

Environment Loading

Load or Create venv (base_code_explorer.py:30-95):
def _load_venv_context(
    self,
    venv_dir=None,
    is_clear_venv=None,
    base_venv_path=None
):
    """Load virtual environment, create if doesn't exist"""
    
    # Determine venv path
    if venv_dir is None:
        default_venvs_dir = './.venvs'
        self.venv_path = os.path.join(default_venvs_dir, "persistent_venv")
    else:
        self.venv_path = os.path.join(venv_dir, "persistent_venv")
    
    # Copy from base environment if available
    if base_venv_path and os.path.exists(base_venv_path):
        if not os.path.exists(self.venv_path):
            print("Copying environment from base...")
            os.system(f"cp -a {base_venv_path} {self.venv_path}")
        
        env_builder = venv.EnvBuilder(with_pip=True)
        self.venv_context = env_builder.ensure_directories(self.venv_path)
        return self.venv_context
    
    # Load existing or create new
    if not self.is_cleanup_venv:
        # Persistent environment
        activate_script = os.path.join(self.venv_path, "bin", "activate")
        if os.path.exists(self.venv_path) and os.path.exists(activate_script):
            self._print_venv_status("loading")
            env_builder = venv.EnvBuilder(with_pip=True)
            self.venv_context = env_builder.ensure_directories(self.venv_path)
        else:
            self._print_venv_setup_notice("persistent")
            self.venv_context = self._create_virtual_env(self.venv_path)
    else:
        # Temporary environment
        self._print_venv_setup_notice("temporary")
        self.venv_context = self._create_virtual_env(self.venv_path)
    
    return self.venv_context

Dependency Installation

Environment Creation (base_code_explorer.py:131-188):
def _create_virtual_env(self, venv_path):
    """Create virtual environment and install basic dependencies"""
    
    # Create venv
    print("Creating virtual environment...")
    self.venv_context = create_virtual_env(venv_path)
    
    # Update pip
    activate_script = os.path.join(venv_path, "bin", "activate")
    activate_cmd = f"bash -c '. {activate_script} && "
    
    print("Updating pip to latest version...")
    subprocess.run(f"{activate_cmd} pip install -U pip'", shell=True)
    
    # Install LLM dependencies
    requirements_path = "configs/docker_src/llm_requirements.txt"
    
    if os.path.exists(requirements_path):
        print("Installing LLM dependencies...")
        subprocess.run(
            f"{activate_cmd} pip install -r {requirements_path}'",
            shell=True
        )
    else:
        print("Installing essential packages (backup method)...")
        subprocess.run(
            f"{activate_cmd} pip install numpy pandas'",
            shell=True
        )
    
    print("Setup completed successfully!")
    return self.venv_context
Isolation:
  • Separate dependencies per task
  • No conflicts with system packages
  • Clean execution environment
Persistence:
  • Reuse environments across tasks
  • Fast startup after first creation
  • Copy from base environment for efficiency
Safety:
  • Controlled package installation
  • Automatic cleanup (optional)
  • Dependency version management

Code Execution Modes

Local Executor with venv

Configuration (agent_code_explore.py:158-164):
elif self.use_venv:
    local_executor = LocalCommandLineCodeExecutor(
        work_dir=self.work_dir,
        timeout=self.timeout,  # 2 hours
        virtual_env_context=self.venv_context
    )
    self.code_execution_config = {"executor": local_executor}

Docker Executor

Configuration (agent_code_explore.py:147-156):
if self.remote_repo_path and not self.use_venv:
    executor = EnhancedDockerCommandLineCodeExecutor(
        image="whc_docker",  # PyTorch + CUDA support
        timeout=self.timeout,
        work_dir=self.work_dir,
        keep_same_path=True,
        network_mode="host"
    )
    self.code_execution_config = {"executor": executor}

Best Practices

For Repository Analysis

Start with Structure

Use list_repository_structure() first to understand organization

Search Before Reading

Use search_keyword_include_code() to locate relevant files

Understand Dependencies

Use find_dependencies() to map import relationships

Trace References

Use find_references() to understand symbol usage

For Code Exploration

  1. Hierarchical Approach: Start broad (structure), then narrow (specific files/functions)
  2. Context Building: Use repository summaries and important modules
  3. Tool Combination: Combine multiple tools for comprehensive understanding
  4. Error Handling: Use issue_solution_search when encountering problems

For Performance

  • Enable virtual environments for isolation (use_venv=True)
  • Use persistent venvs for faster startup (is_cleanup_venv=False)
  • Set appropriate token limits for large repositories
  • Monitor conversation length and enable summarization

Next Steps

Architecture Overview

Return to architecture overview

Quickstart Guide

Start using RepoMaster

Build docs developers (and LLMs) love