TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/ScrapeGraphAI/Scrapegraph-ai/llms.txt
Use this file to discover all available pages before exploring further.
SmartScraperGraph is the simplest and most powerful way to extract data from a single webpage using natural language prompts.
Overview
This example demonstrates how to:- Configure a basic scraping graph
- Use natural language to describe what you want to extract
- Process and display the results
- Monitor execution details
Complete Example
Here’s a working example that extracts an article from Wired.com:Step-by-Step Breakdown
Configure the graph
- llm: The language model to use (OpenAI GPT-4o-mini in this case)
- verbose: Enable detailed logging
- headless: Set to
Falseto see the browser in action
Create and run the graph
SmartScraperGraph instance with:- prompt: Natural language description of what to extract
- source: URL of the webpage to scrape
- config: Your configuration dictionary
Configuration Options
- OpenAI
- Ollama (Local)
- Azure OpenAI
Expected Output
The script will return structured JSON data:Common Use Cases
News Articles
Extract headlines, authors, dates, and content from news websites
Product Information
Scrape product names, prices, descriptions, and reviews
Contact Details
Extract emails, phone numbers, and addresses from business websites
Event Data
Gather event names, dates, locations, and descriptions
Tips for Better Results
Be specific in your prompts: Instead of “get data”, use “Extract the article title, author name, publication date, and first paragraph”.
Use headless mode for production: Set
"headless": True to run the browser in the background for better performance.Handle errors gracefully: Wrap your scraping code in try-except blocks to handle network issues and parsing errors.
Monitoring Execution
Theget_execution_info() method provides valuable insights:
- Execution time for each node
- Token usage and costs
- Errors or warnings
- Graph traversal path
Next Steps
Multi-Page Scraping
Learn to scrape multiple URLs at once
Custom Schemas
Define structured output with Pydantic
Troubleshooting
Issue: Browser doesn’t open- Make sure Playwright is installed:
playwright install - Check if
headlessis set toFalse
- Reduce the number of requests
- Add delays between requests
- Use a different model or provider
- Make your prompt more specific
- Check if the page requires JavaScript rendering
- Verify the page structure hasn’t changed
