Skip to main content

Overview

Data processing involves transforming, cleaning, analyzing, and converting data between formats. RepoMaster can automatically find and utilize the right tools from GitHub to handle various data processing tasks without manual coding.

Common Data Processing Tasks

RepoMaster can help with:

Format Conversion

Convert between CSV, JSON, Excel, XML, Parquet, and more

Data Cleaning

Remove duplicates, handle missing values, standardize formats

Data Extraction

Extract specific fields, filter rows, merge datasets

Data Analysis

Generate statistics, summaries, and visualizations

How It Works

Simply describe your data processing need:
python launcher.py --mode backend --backend-mode unified
Example User Input:
Extract tables from PDF reports and convert to structured CSV format
RepoMaster will:
  1. Search for PDF table extraction tools
  2. Find CSV conversion libraries
  3. Chain the tools together
  4. Process your files
  5. Deliver clean, structured output

Use Case Examples

PDF Table Extraction

From the USAGE.md file: Task:
Extract tables from PDF reports and convert to structured CSV format
1

Repository Discovery

RepoMaster searches for:
  • PDF parsing libraries (pdfplumber, tabula-py, camelot)
  • Table extraction tools
  • CSV conversion utilities
2

Tool Selection

Evaluates repositories based on:
  • Support for complex table structures
  • Accuracy of extraction
  • Output format flexibility
  • Community adoption and maintenance
3

Pipeline Implementation

  • Loads PDF file
  • Detects table boundaries
  • Extracts table data with cell alignment
  • Handles merged cells and complex layouts
  • Converts to clean CSV format
4

Error Handling

Automatically handles:
  • Multi-page tables
  • Rotated or skewed tables
  • Mixed text and table content
  • Various PDF encodings
Expected Output:
Quarter,Revenue,Profit,Growth
Q1 2024,$1.2M,$340K,15%
Q2 2024,$1.5M,$425K,25%
Q3 2024,$1.8M,$520K,20%

Excel to JSON Conversion

Task:
Convert this Excel file with multiple sheets to JSON format, preserving sheet structure
What RepoMaster Does:
  • Finds Excel processing libraries (openpyxl, pandas)
  • Reads all sheets from workbook
  • Preserves data types and relationships
  • Outputs nested JSON structure
Output:
{
  "customers": [
    {"id": 1, "name": "Acme Corp", "revenue": 125000},
    {"id": 2, "name": "TechStart", "revenue": 89000}
  ],
  "products": [
    {"sku": "PRD-001", "name": "Widget A", "price": 49.99},
    {"sku": "PRD-002", "name": "Widget B", "price": 79.99}
  ]
}

CSV Data Cleaning

Task:
Clean this CSV file: remove duplicates, fill missing values,
standardize date formats, and fix inconsistent column names
Processing Steps:
  • Remove duplicate rows based on key columns
  • Handle missing values (forward fill, mean/median, or drop)
  • Convert dates to ISO format (YYYY-MM-DD)
  • Standardize column names (lowercase, underscores)
  • Remove special characters from text fields
  • Validate and clean email addresses, phone numbers

Log File Analysis

Task:
Parse application log files and extract error messages,
timestamps, and user IDs into a structured format
What RepoMaster Does:
  • Finds log parsing tools (logparser, python-log-parser)
  • Defines extraction patterns for your log format
  • Extracts structured fields
  • Aggregates statistics (error counts, user activity)
  • Outputs to CSV or database

Format Conversion Examples

CSV to Excel:
Convert sales_data.csv to Excel with formatting and charts
Excel to CSV:
Extract the 'Orders' sheet from workbook.xlsx to CSV

Advanced Data Processing

Data Merging & Joining

Task:
Merge customers.csv and orders.csv on customer_id,
perform left join and save as combined_data.xlsx
Features:
  • Inner, outer, left, right joins
  • Multiple key columns
  • Handle missing keys gracefully
  • Aggregate data during merge

Data Filtering & Transformation

Task:
Filter sales data for Q4 2024, calculate monthly totals,
and create a summary with growth percentages
Capabilities:
  • Complex filtering conditions
  • Aggregation functions (sum, avg, count, min, max)
  • Calculated columns and derived metrics
  • Grouping and pivoting

Batch Processing

Task:
Process all CSV files in the data/ folder:
clean, standardize, and merge into master_dataset.csv
Benefits:
  • Process multiple files in one go
  • Consistent transformations across files
  • Error handling per file
  • Progress tracking and logging

Data Validation & Quality

1

Schema Validation

Validate that this CSV has required columns: id, name, email, date
and check data types match expected schema
2

Constraint Checking

Verify all email addresses are valid format,
dates are within range 2020-2024,
and prices are positive numbers
3

Duplicate Detection

Find and report duplicate records based on email field,
show counts and examples
4

Completeness Analysis

Generate report showing percentage of missing values per column
and identify rows with multiple missing fields

Real Execution Example

🌟 Unified Assistant started!
============================================================
📋 Task: Extract tables from PDF and convert to CSV
🔧 Analyzing task...

🔍 Identifying task type: Data Processing + PDF Extraction
✅ Routing to Repository Agent

📦 Searching for PDF table extraction tools...
   Found: camelot-py, tabula-py, pdfplumber, pdfminer

📊 Evaluating repositories...
✅ Selected: camelot-py
   Reason: Best table detection accuracy + lattice/stream modes

📦 Cloning repository...
✅ Repository cloned successfully

🔍 Analyzing PDF structure...
   Pages: 15
   Tables detected: 8 tables across 6 pages

⚙️  Extracting tables...
   Page 3: Table 1 - 5 columns × 12 rows
   Page 5: Table 2 - 3 columns × 8 rows
   Page 7: Table 3 - 7 columns × 24 rows
   ...
✅ Extracted 8 tables successfully

🔄 Converting to CSV format...
   Handling merged cells...
   Standardizing column names...
   Cleaning cell values...
✅ Conversion complete

💾 Saving results...
✅ Saved to:
   - coding/table_1.csv
   - coding/table_2.csv
   - coding/table_3.csv
   - coding/tables_combined.csv

📊 Summary:
   Total tables: 8
   Total rows: 247
   Total columns: 38 unique fields
   
✨ Task completed successfully!

Output Options

Specify your preferred output format:
Save results as CSV with UTF-8 encoding

Data Analysis Features

Statistical Summary

Task:
Generate statistical summary of sales_data.csv:
mean, median, std dev, min, max for all numeric columns
Output:
Sales Data Summary
==================
Total Records: 1,247

Revenue:
  Mean: $45,234
  Median: $38,500
  Std Dev: $12,450
  Min: $5,200
  Max: $125,000

Units Sold:
  Mean: 342
  Median: 298
  Std Dev: 156
  Min: 12
  Max: 2,450

Data Visualization

Task:
Create bar chart showing monthly revenue from sales.csv
and save as PNG
RepoMaster finds visualization libraries (matplotlib, plotly) and generates charts automatically.

Best Practices

Always work on copies:
Process customer_data.csv and save cleaned version as customer_data_clean.csv
Clean the data and then validate: check for missing values,
verify row count, and show sample of results
Process sales data: remove duplicates, fill missing values with mean,
convert dates to ISO format, and generate log of all changes
Process this 5GB CSV file in chunks of 100,000 rows
to avoid memory issues

Common Challenges Solved

Encoding Issues

Automatically detects and handles various file encodings (UTF-8, Latin-1, etc.)

Inconsistent Formats

Standardizes date formats, number representations, and text casing

Missing Data

Smart imputation strategies based on data type and distribution

Large Files

Chunk-based processing for files too large for memory

Integration with Other Use Cases

Data processing often combines with other RepoMaster capabilities:
  • Web Scraping → Data Processing: Scrape data, then clean and structure it
  • Data Processing → AI/ML: Prepare data for model training
  • PDF Extraction → Data Processing: Extract tables, then analyze and visualize

Next Steps

AI/ML Tasks

Use processed data for machine learning

Web Scraping

Collect data to process

Repository Agent

Learn how data tools are discovered

Programming Assistant

Custom data processing scripts

Build docs developers (and LLMs) love