Repository Exploration & Analysis

Overview

RepoMaster’s repository exploration system provides intelligent code analysis through hierarchical understanding and tool-assisted exploration. The system builds a comprehensive model of code structure, dependencies, and semantics.

Architecture

CodeExplorer Initialization

Setup Process

Constructor (agent_code_explore.py:25-75):

class CodeExplorer(BaseCodeExplorer):
    def __init__(
        self,
        local_repo_path: str,          # Repository path
        work_dir: str,                 # Working directory
        remote_repo_path: Optional[str] = None,  # Docker path
        llm_config: Optional[dict] = None,
        code_execution_config: Optional[dict] = None,
        task_type: Optional[str] = None,
        use_venv: bool = False,
        task_id: Optional[str] = None,
        is_cleanup_venv: bool = True,
        args: dict = {}
    ):
        # Call base class for venv management
        super().__init__(work_dir, use_venv, task_id, is_cleanup_venv)
        
        # Configuration
        self.llm_config = get_llm_config('code_explore') if llm_config is None else llm_config
        self.code_execution_config = {"work_dir": work_dir, "use_docker": False} \
                                     if code_execution_config is None else code_execution_config
        
        # Timeout: 2 hours
        self.code_execution_config['timeout'] = 2 * 60 * 60
        
        # Setup tool library and agents
        self._setup_tool_library()
        self._setup_agents()

Tool Library Initialization

Method (agent_code_explore.py:76-95):

def _setup_tool_library(self):
    """Setup tool library"""
    if self.local_repo_path:
        self.code_library = CodeExplorerTools(
            self.local_repo_path,
            work_dir=self.work_dir,
            docker_work_dir=self.docker_path_prefix
        )
    else:
        self.code_library = None
    
    # Generate important modules summary
    if self.local_repo_path and self.args.get("repo_init", True):
        self.code_importance = self.code_library.builder.generate_llm_important_modules(
            max_tokens=8000
        )
    else:
        self.code_importance = ""

Initialization Time: The tool library initialization includes tree-sitter parsing and important module detection, which can take 10-30 seconds for large repositories.

Repository Summary Generation

The generate_repository_summary function creates intelligent summaries of repositories (repo_summary.py:8-112).

Importance-Based Filtering

File Importance Analysis (repo_summary.py:26-67):

def judge_file_is_important(
    code_list: list[dict]
) -> list[dict]:
    """Judge whether files are important for understanding the repository"""
    
    judge_prompt = f"""
    You are an assistant that helps developers understand code repositories.
    Judge whether the current file is important.
    
    Rules:
    1. README.md with repository description - very important
    2. Configuration files, test files, example files - very important
    3. Files with information important for understanding - very important
    4. Duplicate file contents - keep only one
    
    Return JSON list (sorted by importance):
    [
        {{
            "file_path": "path",
            "is_important": "yes" or "no"
        }}
    ]
    """
    
    response_dict = AzureGPT4Chat().chat_with_message(
        messages,
        json_format=True
    )
    
    # Filter to only important files
    out_list = [
        file for judge_result in response_dict
        if judge_result['is_important'].lower() == 'yes'
        for file in code_list
        if judge_result['file_path'] == file['file_path']
    ]
    return out_list

README Summary Extraction

Summary Generation (repo_summary.py:114-158):

def get_readme_summary(code_content: str, history_summary: dict):
    """Extract summary from README and important documentation"""
    
    system_prompt = """
    You are an assistant that helps developers understand code repositories.
    Generate summary based on README and documentation files.
    
    Rules:
    1. Focus on main functions, architecture, and usage
    2. Use <cite>referenced content</cite> for important code blocks
    3. Keep summary concise, comprehensive, and informative
    4. Include installation methods, dependencies, example usage
    5. Ignore disclaimers and unimportant content
    6. Avoid duplicating content from history_summary
    """
    
    prompt = f"""
    README and important documents:
    <code_content>
    {code_content}
    </code_content>
    
    Previous summaries:
    <history_summary>
    {history_summary}
    </history_summary>
    """
    
    response = AzureGPT4Chat().chat_with_message(
        [{"role": "system", "content": system_prompt},
         {"role": "user", "content": prompt}],
        json_format=True
    )
    return response

Token-Aware Processing

Smart Splitting (repo_summary.py:69-83, 85-112):

def split_code_lists(code_list: list[dict]):
    """Split code lists by token count (max 50000 tokens)"""
    max_token = 50000
    out_code_list = []
    split_code_list = []
    
    for file in code_list:
        if get_code_abs_token(str(file)) > max_token:
            continue  # Skip files too large
        
        split_code_list.append(file)
        if get_code_abs_token(json.dumps(split_code_list)) > max_token:
            out_code_list.append(split_code_list)
            split_code_list = []
    
    if split_code_list:
        out_code_list.append(split_code_list)
    
    return out_code_list

# Main summary logic
all_file_content = json.dumps(code_list, ensure_ascii=False)
if get_code_abs_token(all_file_content) < max_important_files_token:
    return code_list  # Small enough, return as-is

# Process in batches
important_files = []
for s_code_list in split_code_lists(code_list):
    important_files.extend(judge_file_is_important(s_code_list))

# Generate summaries for important files
repository_summary = {}
for file in important_files:
    summary = get_readme_summary(file['file_content'], repository_summary)
    if '<none>' not in str(summary).lower():
        if get_code_abs_token(json.dumps(repository_summary) + str(summary)) < max_important_files_token:
            repository_summary[file['file_path']] = summary
        else:
            break  # Token limit reached

Analysis Tools

The CodeExplorerTools class provides comprehensive code analysis capabilities.

Tool Registration

Registered Tools (agent_code_explore.py:235-255):

def _register_tools(self):
    """Register tool functions to the executor agent"""
    register_toolkits(
        [
            self.code_library.list_repository_structure,
            self.code_library.search_keyword_include_code,
            self.code_library.view_class_details,
            self.code_library.view_function_details,
            self.code_library.find_references,
            self.code_library.find_dependencies,
            self.code_library.view_file_content,
            self.issue_solution_search,
            FileEditTool.edit,
        ],
        self.explore,     # Assistant agent
        self.executor,    # User proxy agent
    )

Core Analysis Tools

Structure Listing

list_repository_structure() - Get directory tree and file organization

Keyword Search

search_keyword_include_code(keyword) - Find keyword occurrences in code

Class Details

view_class_details(class_name) - Inspect class structure and methods

Function Details

view_function_details(function_name) - View function signatures and implementations

Reference Finding

find_references(symbol) - Find where symbols are used

Dependency Analysis

find_dependencies(module) - Analyze import relationships

File Viewing

view_file_content(file_path) - Read file contents

Issue Solutions

issue_solution_search(issue) - Find solutions for programming issues

Issue Solution Search

Web-Based Problem Solving (agent_code_explore.py:209-233):

async def issue_solution_search(
    self, 
    issue_description: Annotated[str, "Description of programming issue"]
) -> str:
    """
    Search for solutions to specific programming issues or errors.
    
    Sources: GitHub, Stack Overflow, official docs, forums
    
    Returns:
        String containing solution information with:
        - Concise summary
        - Source URL
        - Source name (e.g., "Stack Overflow", "GitHub Issue")
    """
    query = f"""
    Please search for solutions to the following programming issue:
    <issue_description>
    {issue_description}
    </issue_description>
    
    Steps:
    1. Search the web for solutions, code snippets, or discussions
    2. Prioritize well-explained, reputable, highly-rated solutions
    3. Select up to 3 most relevant solutions
    4. For each solution, provide summary, source URL, source name
    5. Present findings as clear, readable text (use markdown)
    
    If no relevant solutions found, indicate that.
    """
    return await self.issue_searcher.a_web_agent_answer(query)

Smart Search: The issue solution search automatically queries multiple sources and ranks results by relevance and reputation.

Conversation Flow

Analysis Execution

Main Analysis Method (agent_code_explore.py:257-317):

async def analyze_code(self, task: str, max_turns: int = 40) -> str:
    """
    Analyze code repository and complete specific tasks
    
    Args:
        task: User's programming task description
        max_turns: Maximum conversation turns
        
    Returns:
        Analysis results and implementation plan
    """
    # Reset tool call count
    self.task = task
    self.current_tool_call_count = 0
    
    # Set initial message based on task type
    if self.task_type == "general":
        initial_message = CODE_ASSISTANT_PROMPT.format(
            task=task, work_dir=self.work_dir
        )
    else:
        initial_message = USER_EXPLORER_PROMPT.format(
            task=task,
            work_dir=self.work_dir,
            remote_repo_path=self.remote_repo_path,
            code_importance=self.code_importance
        )
    
    # Handle restart with history summary
    history_message_list = []
    if self.is_restart and self.restart_count < 2:
        history_message_list = self.executor.chat_messages.get(self.explore, [])
        initial_message = self.summary_chat_history(task, history_message_list)
        self.restart_count += 1
        self.is_restart = False
        history_message_list = json.loads(initial_message)
    
    # Start conversation
    chat_result = await self.executor.a_initiate_chat(
        self.explore,
        message=initial_message,
        max_turns=max_turns,
        history_message_load=history_message_list
    )
    
    # Handle restart if needed
    if self.is_restart and self.restart_count < 2:
        return await self.analyze_code(task, max_turns)
    
    # Extract final result
    messages = chat_result.chat_history
    final_answer = chat_result.summary.strip().lstrip()
    
    return final_answer

Token Limit Management

Termination Logic (agent_code_explore.py:97-142):

def token_limit_termination(self, msg):
    """Check if token limit is reached to decide termination"""
    
    def check_tool_call(msg):
        if msg.get("tool_calls", []):
            return True
        if msg.get("tool_response", []):
            return True
        return False
    
    content = msg.get("content", "")
    if isinstance(content, str):
        content = content.strip()
    
    # Check original termination conditions
    original_termination = (
        content and 
        (len(content.split("TERMINATE")[-1]) < 3 or 
         len(content.split("<TERMINATE>")[-1]) < 2)
    )
    
    if msg is None:
        return False
    
    if (not check_tool_call(msg)) and (not content):
        return True
    
    # Terminate if original conditions met
    if (original_termination and 
        check_code_block(content) is None and
        not check_tool_call(msg)):
        self.is_restart = False
        return True
    
    # Calculate total token count
    messages = self.executor.chat_messages.get(self.explore, [])
    total_tokens = sum(
        get_code_abs_token(str(m.get("content", "")))
        for m in messages
    )
    
    # If over limit (80000 tokens), trigger restart
    if total_tokens > self.limit_restart_tokens:
        self.is_restart = True
        self.chat_turns += len(messages) - 1
        return True
    
    return False

Token Limits:

Token limit for summary: 2000 tokens (agent_code_explore.py:54)
Restart token limit: 80000 tokens (agent_code_explore.py:55)
Maximum important modules: 8000 tokens (agent_code_explore.py:91)

Task-Specific Prompts

General Code Assistant

Enhanced Task Description (agent_scheduler.py:311-329):

enhanced_task = f"""
You are a general programming assistant. Please help with the following task:

{task_description}

As a programming assistant, you can:
- Write and execute code to solve problems
- Provide programming guidance and explanations  
- Create practical examples and demonstrations
- Debug and troubleshoot issues
- Implement algorithms and data structures
- Explain programming concepts
- Create utility scripts and tools

Working directory: {work_dir}

Please provide comprehensive help including code examples, 
explanations, and practical solutions.
"""