Skip to main content
The ScrapAI CLI provides a comprehensive interface for building, managing, and running AI-powered web scrapers. All commands interact with a database-first architecture where spiders are stored as JSON configurations.

Architecture

ScrapAI uses a project-based organization model:
  • Projects: Logical groupings of spiders (e.g., news, ecommerce, research)
  • Spiders: JSON configurations stored in the database
  • Queue: Database-backed queue for batch processing
  • Data: Test mode saves to database, production mode exports to JSONL files

Entry Point

The scrapai script automatically activates the virtual environment and delegates to the CLI:
# Linux/macOS
./scrapai <command> [options]

# Windows
scrapai <command> [options]
All commands run through the scrapai wrapper, which handles virtual environment activation automatically.

Command Categories

Setup & Verification

Install dependencies, configure environment, verify setup

Spider Management

List, import, delete, and manage spider configurations

Crawling

Run spiders in test or production mode with checkpoint support

Queue Management

Add URLs, bulk import, process items in parallel batches

Data Operations

View scraped items, export to CSV/JSON/Parquet

Inspection

Analyze websites for scraper development

Database

Migrations, queries, statistics, data transfer

Projects

List and manage project configurations

Global Conventions

Project Names

Most commands require a --project flag to specify the project context:
./scrapai spiders list --project news
./scrapai crawl bbc_co_uk --project news
Default project name is default if not specified, but it’s recommended to always use explicit project names for clarity.

Output Modes

Test Mode (with --limit):
  • Saves scraped items to database
  • Limited number of items
  • Use show command to view results
  • No HTML content stored
Production Mode (no limit):
  • Exports to timestamped JSONL files in data/<project>/<spider>/crawls/
  • Includes full HTML content
  • Enables checkpoint pause/resume
  • Database writes disabled for performance

File Paths

All data is stored under the DATA_DIR configured in .env (default: ./data):
data/
├── <project>/
│   ├── <spider>/
│   │   ├── crawls/         # Production JSONL exports
│   │   ├── exports/        # Manual exports (CSV/JSON/Parquet)
│   │   └── checkpoint/     # Pause/resume state

Common Workflows

Quick Test

Test a spider on 5-10 URLs to verify extraction:
./scrapai crawl myspider --project myproject --limit 5
./scrapai show myspider --project myproject

Production Crawl

Run a full crawl with checkpoint support:
./scrapai crawl myspider --project myproject
# Press Ctrl+C to pause
# Run same command to resume

Batch Processing

Add multiple websites to queue and process:
./scrapai queue bulk urls.csv --project myproject
./scrapai queue list --project myproject
./scrapai queue next --project myproject  # Claim next item

Export Data

Export scraped data in various formats:
./scrapai export myspider --project myproject --format csv
./scrapai export myspider --project myproject --format parquet

Platform Support

  • Linux: Full support including headless Cloudflare bypass with xvfb
  • macOS: Full support
  • Windows: Full support (use scrapai.bat or scrapai directly)

Database Support

  • SQLite: Default, zero configuration
  • PostgreSQL: Production deployments, atomic queue operations
Switch by updating DATABASE_URL in .env and running migrations.

Next Steps

Setup Commands

Install ScrapAI and verify your environment

Spider Management

Learn how to import and manage spider configurations

Build docs developers (and LLMs) love