Documentation Index Fetch the complete documentation index at: https://mintlify.com/ScrapeGraphAI/Scrapegraph-ai/llms.txt
Use this file to discover all available pages before exploring further.
Schemas
Schemas in ScrapeGraphAI use Pydantic models to define the structure and validation rules for extracted data. They ensure your scraping results are consistently formatted and type-safe.
Why Use Schemas?
Schemas provide several benefits:
Type Safety Enforce data types and validation rules
Structure Define exact output format
Documentation Self-documenting data models
IDE Support Autocomplete and type hints
Basic Schema
Schemas are Pydantic BaseModel classes:
from pydantic import BaseModel, Field
class Product ( BaseModel ):
name: str = Field( description = "The product name" )
price: float = Field( description = "Price in USD" )
available: bool = Field( description = "Whether product is in stock" )
Using the Schema
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." }
}
scraper = SmartScraperGraph(
prompt = "Extract product information" ,
source = "https://example.com/product" ,
config = graph_config,
schema = Product # Pass the schema class
)
result = scraper.run()
print (result)
# Output: {'name': 'Laptop', 'price': 999.99, 'available': True}
The LLM is automatically guided to generate output matching your schema structure.
Field Descriptions
Always include descriptions using Field(). These help the LLM understand what to extract:
from pydantic import BaseModel, Field
class Article ( BaseModel ):
title: str = Field(
description = "The main headline or title of the article"
)
author: str = Field(
description = "Full name of the article author"
)
published_date: str = Field(
description = "Publication date in YYYY-MM-DD format"
)
summary: str = Field(
description = "A brief 2-3 sentence summary of the article content"
)
tags: list[ str ] = Field(
description = "List of relevant topic tags or categories"
)
Clear, detailed descriptions significantly improve extraction accuracy.
Complex Schemas
Nested Objects
Create hierarchical data structures:
from pydantic import BaseModel, Field
from typing import List
class Address ( BaseModel ):
street: str = Field( description = "Street address" )
city: str = Field( description = "City name" )
country: str = Field( description = "Country name" )
postal_code: str = Field( description = "Postal/ZIP code" )
class Contact ( BaseModel ):
email: str = Field( description = "Email address" )
phone: str = Field( description = "Phone number" )
class Company ( BaseModel ):
name: str = Field( description = "Company name" )
description: str = Field( description = "Company description" )
address: Address = Field( description = "Company address" )
contact: Contact = Field( description = "Contact information" )
employee_count: int = Field( description = "Number of employees" )
# Usage
scraper = SmartScraperGraph(
prompt = "Extract company information" ,
source = "https://example.com/about" ,
config = graph_config,
schema = Company
)
result = scraper.run()
print (result[ 'address' ][ 'city' ]) # Access nested data
Lists of Objects
Extract multiple items with a wrapper class:
from pydantic import BaseModel, Field
from typing import List
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Price in USD" )
rating: float = Field( description = "Average rating out of 5" )
reviews: int = Field( description = "Number of reviews" )
class ProductList ( BaseModel ):
products: List[Product] = Field(
description = "List of all products found on the page"
)
# Usage
scraper = SmartScraperGraph(
prompt = "Extract all products from the catalog" ,
source = "https://example.com/products" ,
config = graph_config,
schema = ProductList
)
result = scraper.run()
for product in result[ 'products' ]:
print ( f " { product[ 'name' ] } : $ { product[ 'price' ] } " )
When extracting multiple items, always use a wrapper class with a List field.
Optional Fields
Make fields optional when data might not be available:
from pydantic import BaseModel, Field
from typing import Optional
class JobPosting ( BaseModel ):
title: str = Field( description = "Job title" )
company: str = Field( description = "Company name" )
location: str = Field( description = "Job location" )
salary: Optional[ str ] = Field(
default = None ,
description = "Salary range if available"
)
remote: Optional[ bool ] = Field(
default = None ,
description = "Whether job is remote"
)
description: str = Field( description = "Job description" )
Default Values
Provide defaults for fields:
from pydantic import BaseModel, Field
class Review ( BaseModel ):
author: str = Field( description = "Reviewer name" )
rating: int = Field( description = "Rating from 1-5" )
comment: str = Field( description = "Review text" )
verified: bool = Field(
default = False ,
description = "Whether purchase is verified"
)
helpful_count: int = Field(
default = 0 ,
description = "Number of helpful votes"
)
Field Validation
Add custom validation rules:
from pydantic import BaseModel, Field, field_validator
from typing import List
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Price in USD" , gt = 0 ) # Must be > 0
rating: float = Field(
description = "Rating from 0-5" ,
ge = 0 , # Greater than or equal to 0
le = 5 # Less than or equal to 5
)
tags: List[ str ] = Field( description = "Product tags" )
@field_validator ( 'price' )
@ classmethod
def price_must_be_positive ( cls , v ):
if v <= 0 :
raise ValueError ( 'Price must be positive' )
return v
@field_validator ( 'tags' )
@ classmethod
def tags_must_not_be_empty ( cls , v ):
if not v:
raise ValueError ( 'At least one tag required' )
return v
Common Patterns
E-commerce Product
from pydantic import BaseModel, Field
from typing import List, Optional
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Current price in USD" )
original_price: Optional[ float ] = Field(
default = None ,
description = "Original price if discounted"
)
description: str = Field( description = "Product description" )
images: List[ str ] = Field( description = "List of product image URLs" )
available: bool = Field( description = "Whether in stock" )
rating: Optional[ float ] = Field(
default = None ,
description = "Average rating 0-5"
)
review_count: Optional[ int ] = Field(
default = 0 ,
description = "Number of reviews"
)
class Products ( BaseModel ):
products: List[Product]
News Article
from pydantic import BaseModel, Field
from typing import List, Optional
class Article ( BaseModel ):
title: str = Field( description = "Article headline" )
author: str = Field( description = "Author name" )
published_date: str = Field( description = "Publication date" )
category: str = Field( description = "Article category" )
summary: str = Field( description = "Brief summary" )
content: str = Field( description = "Full article text" )
tags: List[ str ] = Field( description = "Article tags" )
image_url: Optional[ str ] = Field(
default = None ,
description = "Featured image URL"
)
class Articles ( BaseModel ):
articles: List[Article]
Job Listings
from pydantic import BaseModel, Field
from typing import List, Optional
class Job ( BaseModel ):
title: str = Field( description = "Job title" )
company: str = Field( description = "Company name" )
location: str = Field( description = "Job location" )
salary_range: Optional[ str ] = Field(
default = None ,
description = "Salary range"
)
job_type: str = Field( description = "Full-time, part-time, contract, etc." )
remote: bool = Field( description = "Whether job is remote" )
description: str = Field( description = "Job description" )
requirements: List[ str ] = Field( description = "Required qualifications" )
posted_date: str = Field( description = "Date posted" )
class JobListings ( BaseModel ):
jobs: List[Job]
from pydantic import BaseModel, Field
from typing import List, Optional
class MenuItem ( BaseModel ):
name: str = Field( description = "Dish name" )
description: str = Field( description = "Dish description" )
price: float = Field( description = "Price in USD" )
category: str = Field( description = "Menu category (appetizer, entree, etc.)" )
dietary: Optional[List[ str ]] = Field(
default = None ,
description = "Dietary tags (vegetarian, vegan, gluten-free, etc.)"
)
class Menu ( BaseModel ):
restaurant_name: str = Field( description = "Restaurant name" )
items: List[MenuItem] = Field( description = "All menu items" )
Schema with Search Graph
Schemas work with all graph types:
from pydantic import BaseModel, Field
from typing import List
from scrapegraphai.graphs import SearchGraph
class Event ( BaseModel ):
name: str = Field( description = "Event name" )
date: str = Field( description = "Event date" )
location: str = Field( description = "Event location" )
description: str = Field( description = "Event description" )
class Events ( BaseModel ):
events: List[Event]
search_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." },
"verbose" : True
}
search_graph = SearchGraph(
prompt = "Find upcoming tech conferences in 2024" ,
config = search_config,
schema = Events
)
result = search_graph.run()
print (result)
Schema with JSON Sources
Schemas also structure data from JSON scraping:
from scrapegraphai.graphs import JSONScraperGraph
class User ( BaseModel ):
id : int = Field( description = "User ID" )
name: str = Field( description = "Full name" )
email: str = Field( description = "Email address" )
class Users ( BaseModel ):
users: List[User]
json_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." },
}
json_scraper = JSONScraperGraph(
prompt = "Extract all user information" ,
source = "users.json" ,
config = json_config,
schema = Users
)
result = json_scraper.run()
Without Schemas
You can also scrape without defining a schema:
scraper = SmartScraperGraph(
prompt = "Extract product name, price, and description" ,
source = "https://example.com/product" ,
config = graph_config
# No schema parameter
)
result = scraper.run()
# LLM returns unstructured JSON
print (result)
Without a schema, output structure is less predictable. Schemas are recommended for production use.
Best Practices
# Good
name: str = Field( description = "Product name as shown on the page" )
# Bad
name: str
Descriptions guide the LLM on what to extract.
# Good
published_date: str = Field( description = "Publication date" )
author_name: str = Field( description = "Author full name" )
# Bad
dt: str
auth: str
Make Optional Fields Explicit
# Good
salary: Optional[ str ] = Field( default = None , description = "Salary if listed" )
# Bad (raises error if not found)
salary: str = Field( description = "Salary" )
Use Wrapper Classes for Lists
# Good
class Products ( BaseModel ):
products: List[Product]
# Avoid returning List[Product] directly
Don’t try to extract everything. Create focused schemas for specific data: # Good: Focused on products
class Product ( BaseModel ):
name: str
price: float
available: bool
# Bad: Too many unrelated fields
class Page ( BaseModel ):
product_name: str
product_price: float
site_title: str
footer_text: str
ad_content: str
Troubleshooting
Schema Not Being Followed
Check field descriptions - Make them clear and specific
Simplify the schema - Start simple, add complexity gradually
Verify data exists - Ensure the data is on the page
Use verbose mode - See what’s being sent to the LLM
config = {
"llm" : { ... },
"verbose" : True # Enable detailed logging
}
Missing Optional Fields
Ensure optional fields have defaults:
# Correct
rating: Optional[ float ] = Field( default = None , description = "Rating" )
# Will error if not found
rating: Optional[ float ] = Field( description = "Rating" )
Validation Errors
Check field constraints:
price: float = Field( description = "Price" , gt = 0 ) # Must be positive
rating: float = Field( description = "Rating" , ge = 0 , le = 5 ) # 0-5 range
Next Steps
Configuration Learn about graph configuration
Examples See complete schema examples
Pydantic Docs Learn more about Pydantic
API Reference View API documentation