CLI Overview - ScrapAI

The ScrapAI CLI provides a comprehensive interface for building, managing, and running AI-powered web scrapers. All commands interact with a database-first architecture where spiders are stored as JSON configurations.

Architecture

ScrapAI uses a project-based organization model:

Projects: Logical groupings of spiders (e.g., news, ecommerce, research)
Spiders: JSON configurations stored in the database
Queue: Database-backed queue for batch processing
Data: Test mode saves to database, production mode exports to JSONL files

Entry Point

The scrapai script automatically activates the virtual environment and delegates to the CLI:

# Linux/macOS
./scrapai <command> [options]

# Windows
scrapai <command> [options]

All commands run through the scrapai wrapper, which handles virtual environment activation automatically.

Command Categories

Setup & Verification

Install dependencies, configure environment, verify setup

Spider Management

List, import, delete, and manage spider configurations

Crawling

Run spiders in test or production mode with checkpoint support

Queue Management

Add URLs, bulk import, process items in parallel batches

Data Operations

View scraped items, export to CSV/JSON/Parquet

Inspection

Analyze websites for scraper development

Database

Migrations, queries, statistics, data transfer

Projects

List and manage project configurations

Global Conventions

Project Names

Most commands require a --project flag to specify the project context:

./scrapai spiders list --project news
./scrapai crawl bbc_co_uk --project news

Default project name is default if not specified, but it’s recommended to always use explicit project names for clarity.

Output Modes

Test Mode (with --limit):

Saves scraped items to database
Limited number of items
Use show command to view results
No HTML content stored

Production Mode (no limit):

Exports to timestamped JSONL files in data/<project>/<spider>/crawls/
Includes full HTML content
Enables checkpoint pause/resume
Database writes disabled for performance

File Paths

All data is stored under the DATA_DIR configured in .env (default: ./data):

data/
├── <project>/
│   ├── <spider>/
│   │   ├── crawls/         # Production JSONL exports
│   │   ├── exports/        # Manual exports (CSV/JSON/Parquet)
│   │   └── checkpoint/     # Pause/resume state

Common Workflows

Quick Test

Test a spider on 5-10 URLs to verify extraction:

./scrapai crawl myspider --project myproject --limit 5
./scrapai show myspider --project myproject

Production Crawl

Run a full crawl with checkpoint support:

./scrapai crawl myspider --project myproject
# Press Ctrl+C to pause
# Run same command to resume

Batch Processing

Add multiple websites to queue and process:

./scrapai queue bulk urls.csv --project myproject
./scrapai queue list --project myproject
./scrapai queue next --project myproject  # Claim next item

Export Data

Export scraped data in various formats:

./scrapai export myspider --project myproject --format csv
./scrapai export myspider --project myproject --format parquet

Platform Support

Linux: Full support including headless Cloudflare bypass with xvfb
macOS: Full support
Windows: Full support (use scrapai.bat or scrapai directly)

Database Support

SQLite: Default, zero configuration
PostgreSQL: Production deployments, atomic queue operations

Switch by updating DATABASE_URL in .env and running migrations.

Next Steps

Setup Commands

Install ScrapAI and verify your environment

Spider Management

Learn how to import and manage spider configurations

Commands

​Architecture

​Entry Point

​Command Categories

Setup & Verification

Spider Management

Crawling

Queue Management

Data Operations

Inspection

Database

Projects

​Global Conventions

​Project Names

​Output Modes

​File Paths

​Common Workflows

​Quick Test

​Production Crawl

​Batch Processing

​Export Data

​Platform Support

​Database Support

​Next Steps

Setup Commands

Spider Management

Build docs developers (and LLMs) love

Architecture

Entry Point

Command Categories

Global Conventions

Project Names

Output Modes

File Paths

Common Workflows

Quick Test

Production Crawl

Batch Processing

Export Data

Platform Support

Database Support

Next Steps