Overview
The Airflow integration provides:- Automatic DAG generation from your spider database
- Project-based organization with filtering and access control
- Scheduled crawls with configurable intervals
- Real-time monitoring with logs and execution history
- S3 upload with gzip compression (optional)
Architecture
Quick Start
1. Configure Environment
Add to your.env file:
2. Start Airflow
3. Access Web UI
Open http://localhost:8080 and log in with your credentials. You’ll see DAGs for each spider in your database, named{project}_{spider_name}.
DAG Generation
DAGs are generated dynamically from your spider database. The generator runs on scheduler refresh (every few minutes).DAG Naming Convention
Pattern:{project}_{spider_name}
Examples:
news_bbc_co_ukclimate_team_climate_newsdefault_example_spider(if no project set)
DAG Configuration
Each DAG includes:Task Structure
Each DAG has 2-3 tasks:-
crawl_spider: Runs
./scrapai crawl {spider_name} --timeout 28800- 8-hour graceful timeout
- 9-hour hard kill as fallback
-
verify_results: Runs
./scrapai show {spider_name} --limit 5- Verifies data was extracted
- Shows sample of results
-
upload_to_s3 (optional): Compresses and uploads to S3
- Only runs if S3 credentials are configured
- Gzip compression before upload
- Preserves folder structure
Scheduling Spiders
By default, spiders have no schedule (manual triggering only). To add scheduling:Option 1: Database Column
Add aschedule_interval column to your spiders table:
Option 2: Edit DAG Generator
Modifyairflow/dags/scrapai_spider_dags.py:
Common Schedules
| Interval | Cron Expression | Description |
|---|---|---|
@hourly | 0 * * * * | Every hour at minute 0 |
@daily | 0 0 * * * | Daily at midnight |
@weekly | 0 0 * * 0 | Weekly on Sunday |
| Custom | 0 */6 * * * | Every 6 hours |
| Custom | 0 9 * * 1-5 | Weekdays at 9am |
Project-Based Organization
Filtering by Project
- Go to Airflow UI → DAGs page
- Click a project tag:
project:your_project_name - See only that project’s spiders
Environment Variable Filter
Limit which projects appear in Airflow:Triggering Crawls
Via Web UI
- Go to DAGs page
- Find your spider DAG
- Click the “Play” button (▶)
- Monitor progress in real-time
Via CLI
Via REST API
Monitoring
View Execution Logs
- Click DAG name
- Select a DAG run (date/time)
- Click task (green/red box)
- Click “Log” button
Execution History
Each DAG shows:- Last run status (success/fail)
- Run duration
- Success rate over time
- Task dependencies graph
Stats Available
- Duration: How long each crawl took
- Records scraped: From verify task output
- Failures: Which spiders are broken
- Trends: Performance over time
S3 Integration
Upload crawl results to S3-compatible storage with automatic gzip compression.Configuration
Add to.env:
Upload Behavior
Fromairflow/dags/scrapai_spider_dags.py:61-139:
s3://bucket/spider_name/YYYY-MM-DD/crawl_HHMMSS.jsonl.gz
Access Control (RBAC)
Creating Project-Specific Roles
- Go to Security → List Roles
- Click ”+” to add new role
- Name:
project_news_admin - Select permissions:
can_readonDAG:news_*can_editonDAG:news_*can_triggeronDAG:news_*
Creating Users
- Go to Security → List Users
- Click ”+” to add new user
- Assign role:
project_news_admin
Permission Levels
| Role | Can View | Can Trigger | Can Edit | Can Delete |
|---|---|---|---|---|
| Admin | All DAGs | Yes | Yes | Yes |
| Project Admin | Project DAGs | Yes | Yes | Yes |
| Project User | Project DAGs | Yes | Yes | No |
| Viewer | Project DAGs | No | No | No |
Programmatic Access Control
Uncomment inairflow/dags/scrapai_spider_dags.py:193-196:
Alerting
Email Notifications
EditDEFAULT_DAG_ARGS in scrapai_spider_dags.py:50-58:
Configure SMTP
Add todocker-compose.airflow.yml environment:
Custom Alerts
Add custom task after verify:Management Commands
Troubleshooting
DAGs Not Showing Up
Check DAG file for errors:Spider Crawls Failing
Check task logs in Airflow UI:- Click failed task (red box)
- Click “Log” button
- Look for error messages
Database Connection Issues
Use host.docker.internal instead of localhost:Best Practices
Resource Management
- Set
max_active_runs=1to prevent concurrent runs - Use
execution_timeoutto prevent runaway tasks - Monitor memory usage for large crawls
Scheduling Strategy
- High-frequency sites (news):
@hourlyor0 */6 * * * - Daily updates:
@daily(midnight) or0 9 * * *(9am) - Weekly archives:
0 0 * * 0(Sunday midnight) - Manual only:
None(on-demand triggering)
Monitoring
- Set up email alerts for failures
- Review execution times weekly
- Check success rates for broken spiders
- Monitor S3 storage growth
See Also
Parallel Crawling
Run multiple spiders simultaneously with GNU parallel
Security
Security validation and agent safety features