Data Processing - RepoMaster

Overview

Data processing involves transforming, cleaning, analyzing, and converting data between formats. RepoMaster can automatically find and utilize the right tools from GitHub to handle various data processing tasks without manual coding.

Common Data Processing Tasks

RepoMaster can help with:

Format Conversion

Convert between CSV, JSON, Excel, XML, Parquet, and more

Data Cleaning

Remove duplicates, handle missing values, standardize formats

Data Extraction

Extract specific fields, filter rows, merge datasets

Data Analysis

Generate statistics, summaries, and visualizations

How It Works

Simply describe your data processing need:

python launcher.py --mode backend --backend-mode unified

Example User Input:

Extract tables from PDF reports and convert to structured CSV format

RepoMaster will:

Search for PDF table extraction tools
Find CSV conversion libraries
Chain the tools together
Process your files
Deliver clean, structured output

Use Case Examples

PDF Table Extraction

From the USAGE.md file: Task:

Extract tables from PDF reports and convert to structured CSV format

Repository Discovery

RepoMaster searches for:

PDF parsing libraries (pdfplumber, tabula-py, camelot)
Table extraction tools
CSV conversion utilities

Tool Selection

Evaluates repositories based on:

Support for complex table structures
Accuracy of extraction
Output format flexibility
Community adoption and maintenance

Pipeline Implementation

Loads PDF file
Detects table boundaries
Extracts table data with cell alignment
Handles merged cells and complex layouts
Converts to clean CSV format

Error Handling

Automatically handles:

Multi-page tables
Rotated or skewed tables
Mixed text and table content
Various PDF encodings

Expected Output:

Quarter,Revenue,Profit,Growth
Q1 2024,$1.2M,$340K,15%
Q2 2024,$1.5M,$425K,25%
Q3 2024,$1.8M,$520K,20%

Excel to JSON Conversion

Task:

Convert this Excel file with multiple sheets to JSON format, preserving sheet structure

What RepoMaster Does:

Finds Excel processing libraries (openpyxl, pandas)
Reads all sheets from workbook
Preserves data types and relationships
Outputs nested JSON structure

Output:

{
  "customers": [
    {"id": 1, "name": "Acme Corp", "revenue": 125000},
    {"id": 2, "name": "TechStart", "revenue": 89000}
  ],
  "products": [
    {"sku": "PRD-001", "name": "Widget A", "price": 49.99},
    {"sku": "PRD-002", "name": "Widget B", "price": 79.99}
  ]
}

CSV Data Cleaning

Task:

Clean this CSV file: remove duplicates, fill missing values,
standardize date formats, and fix inconsistent column names

Processing Steps:

Remove duplicate rows based on key columns
Handle missing values (forward fill, mean/median, or drop)
Convert dates to ISO format (YYYY-MM-DD)
Standardize column names (lowercase, underscores)
Remove special characters from text fields
Validate and clean email addresses, phone numbers

Log File Analysis

Task:

Parse application log files and extract error messages,
timestamps, and user IDs into a structured format

What RepoMaster Does:

Finds log parsing tools (logparser, python-log-parser)
Defines extraction patterns for your log format
Extracts structured fields
Aggregates statistics (error counts, user activity)
Outputs to CSV or database

Format Conversion Examples

CSV ↔ Excel
JSON ↔ CSV
XML ↔ JSON
PDF ↔ Text

CSV to Excel:

Convert sales_data.csv to Excel with formatting and charts

Excel to CSV:

Extract the 'Orders' sheet from workbook.xlsx to CSV

JSON to CSV:

Flatten this nested JSON file to CSV with all fields

CSV to JSON:

Convert products.csv to JSON array with proper data types

XML to JSON:

Convert this XML API response to JSON format

JSON to XML:

Convert JSON data to XML with custom root element

PDF to Text:

Extract all text content from this PDF preserving layout

PDF to Markdown:

Convert academic PDF to Markdown with sections and formatting

Advanced Data Processing

Data Merging & Joining

Task:

Merge customers.csv and orders.csv on customer_id,
perform left join and save as combined_data.xlsx

Features:

Inner, outer, left, right joins
Multiple key columns
Handle missing keys gracefully
Aggregate data during merge

Data Filtering & Transformation

Task:

Filter sales data for Q4 2024, calculate monthly totals,
and create a summary with growth percentages

Capabilities:

Complex filtering conditions
Aggregation functions (sum, avg, count, min, max)
Calculated columns and derived metrics
Grouping and pivoting

Batch Processing

Task:

Process all CSV files in the data/ folder:
clean, standardize, and merge into master_dataset.csv

Benefits:

Process multiple files in one go
Consistent transformations across files
Error handling per file
Progress tracking and logging

Data Validation & Quality

Schema Validation

Validate that this CSV has required columns: id, name, email, date
and check data types match expected schema

Constraint Checking

Verify all email addresses are valid format,
dates are within range 2020-2024,
and prices are positive numbers

Duplicate Detection

Find and report duplicate records based on email field,
show counts and examples

Completeness Analysis

Generate report showing percentage of missing values per column
and identify rows with multiple missing fields

Real Execution Example

🌟 Unified Assistant started!
============================================================
📋 Task: Extract tables from PDF and convert to CSV
🔧 Analyzing task...

🔍 Identifying task type: Data Processing + PDF Extraction
✅ Routing to Repository Agent

📦 Searching for PDF table extraction tools...
   Found: camelot-py, tabula-py, pdfplumber, pdfminer

📊 Evaluating repositories...
✅ Selected: camelot-py
   Reason: Best table detection accuracy + lattice/stream modes

📦 Cloning repository...
✅ Repository cloned successfully

🔍 Analyzing PDF structure...
   Pages: 15
   Tables detected: 8 tables across 6 pages

⚙️  Extracting tables...
   Page 3: Table 1 - 5 columns × 12 rows
   Page 5: Table 2 - 3 columns × 8 rows
   Page 7: Table 3 - 7 columns × 24 rows
   ...
✅ Extracted 8 tables successfully

🔄 Converting to CSV format...
   Handling merged cells...
   Standardizing column names...
   Cleaning cell values...
✅ Conversion complete

💾 Saving results...
✅ Saved to:
   - coding/table_1.csv
   - coding/table_2.csv
   - coding/table_3.csv
   - coding/tables_combined.csv

📊 Summary:
   Total tables: 8
   Total rows: 247
   Total columns: 38 unique fields
   
✨ Task completed successfully!

Output Options

Specify your preferred output format:

Save results as CSV with UTF-8 encoding

Data Analysis Features

Statistical Summary

Task:

Generate statistical summary of sales_data.csv:
mean, median, std dev, min, max for all numeric columns

Output:

Sales Data Summary
==================
Total Records: 1,247

Revenue:
  Mean: $45,234
  Median: $38,500
  Std Dev: $12,450
  Min: $5,200
  Max: $125,000

Units Sold:
  Mean: 342
  Median: 298
  Std Dev: 156
  Min: 12
  Max: 2,450

Data Visualization

Task:

Create bar chart showing monthly revenue from sales.csv
and save as PNG

RepoMaster finds visualization libraries (matplotlib, plotly) and generates charts automatically.

Best Practices

Backup original data

Always work on copies:

Process customer_data.csv and save cleaned version as customer_data_clean.csv

Validate after transformation

Clean the data and then validate: check for missing values,
verify row count, and show sample of results

Document transformations

Process sales data: remove duplicates, fill missing values with mean,
convert dates to ISO format, and generate log of all changes

Handle large files efficiently

Process this 5GB CSV file in chunks of 100,000 rows
to avoid memory issues

Common Challenges Solved

Encoding Issues

Automatically detects and handles various file encodings (UTF-8, Latin-1, etc.)

Inconsistent Formats

Standardizes date formats, number representations, and text casing

Missing Data

Smart imputation strategies based on data type and distribution

Large Files

Chunk-based processing for files too large for memory

Integration with Other Use Cases

Data processing often combines with other RepoMaster capabilities:

Web Scraping → Data Processing: Scrape data, then clean and structure it
Data Processing → AI/ML: Prepare data for model training
PDF Extraction → Data Processing: Extract tables, then analyze and visualize

Next Steps

AI/ML Tasks

Use processed data for machine learning

Web Scraping

Collect data to process

Repository Agent

Learn how data tools are discovered

Programming Assistant

Custom data processing scripts

Getting Started

Core Concepts

Agents

Usage Modes

Configuration

Use Cases

Documentation Index

​Overview

​Common Data Processing Tasks

Format Conversion

Data Cleaning

Data Extraction

Data Analysis

​How It Works

​Use Case Examples

​PDF Table Extraction

​Excel to JSON Conversion

​CSV Data Cleaning

​Log File Analysis

​Format Conversion Examples

​Advanced Data Processing

​Data Merging & Joining

​Data Filtering & Transformation

​Batch Processing

​Data Validation & Quality

​Real Execution Example

​Output Options

​Data Analysis Features

​Statistical Summary

​Data Visualization

​Best Practices

​Common Challenges Solved

Encoding Issues

Inconsistent Formats

Missing Data

Large Files

​Integration with Other Use Cases

​Next Steps

AI/ML Tasks

Web Scraping

Repository Agent

Programming Assistant

Build docs developers (and LLMs) love

Overview

Common Data Processing Tasks

How It Works

Use Case Examples

PDF Table Extraction

Excel to JSON Conversion

CSV Data Cleaning

Log File Analysis

Format Conversion Examples

Advanced Data Processing

Data Merging & Joining

Data Filtering & Transformation

Batch Processing

Data Validation & Quality

Real Execution Example

Output Options

Data Analysis Features

Statistical Summary

Data Visualization

Best Practices

Common Challenges Solved

Integration with Other Use Cases

Next Steps