Documentation Index Fetch the complete documentation index at: https://mintlify.com/ScrapeGraphAI/Scrapegraph-ai/llms.txt
Use this file to discover all available pages before exploring further.
Common Issues
Installation Problems
Playwright Not Installed
Error:
PlaywrightError: Executable doesn't exist at /path/to/playwright
Solution:
# Install Playwright browsers after installing scrapegraphai
pip install scrapegraphai
playwright install
# Or install with dependencies
playwright install chromium
Missing Dependencies
Error:
ImportError: cannot import name 'ChatOpenAI' from 'langchain_openai'
Solution:
# Install in a fresh virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade scrapegraphai
Graph Execution Errors
Empty Results
Problem: Graph executes but returns empty or null results.
Check Input Keys
Ensure the initial state contains all required keys: # Incorrect - missing user_prompt
result = graph.execute({ "url" : "https://example.com" })
# Correct
result = graph.execute({
"user_prompt" : "Extract product data" ,
"url" : "https://example.com"
})
Enable Verbose Mode
graph_config = {
"llm" : { ... },
"verbose" : True , # Enable detailed logging
}
Check Node Outputs
Inspect execution info: state, execution_info = graph.execute(initial_state)
for node_info in execution_info:
print ( f "Node: { node_info[ 'node_name' ] } " )
print ( f "Execution time: { node_info[ 'exec_time' ] } s" )
print ( f "Tokens used: { node_info[ 'total_tokens' ] } " )
Node Execution Failures
Error:
ValueError: No state keys matched the expression
Solution:
Check that your node’s input expression matches available state keys:
# If state = {"url": "...", "user_prompt": "..."}
# This will fail
node = FetchNode(
input = "source" , # 'source' not in state
output = [ "doc" ]
)
# This works
node = FetchNode(
input = "url" , # 'url' is in state
output = [ "doc" ]
)
LLM Issues
API Key Errors
Error:
AuthenticationError: Invalid API key provided
Solution:
Verify API Key
import os
print ( f "API Key: { os.getenv( 'OPENAI_API_KEY' )[: 10 ] } ..." ) # Check first 10 chars
Use Environment Variables
# .env file
OPENAI_API_KEY = sk - proj - ...
# In your code
from dotenv import load_dotenv
load_dotenv()
graph_config = {
"llm" : {
"model" : "openai/gpt-4o" ,
"api_key" : os.getenv( "OPENAI_API_KEY" ),
},
}
Test API Connection
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model = "gpt-4o" ,
api_key = os.getenv( "OPENAI_API_KEY" )
)
response = llm.invoke( "Hello" )
print (response.content) # Should work if API key is valid
Rate Limits
Error:
RateLimitError: You exceeded your current quota
Solution:
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry (
stop = stop_after_attempt( 3 ),
wait = wait_exponential( multiplier = 1 , min = 4 , max = 10 )
)
def run_scraper ( url , prompt ):
scraper = SmartScraperGraph(
prompt = prompt,
source = url,
config = graph_config,
)
return scraper.run()
# Use with retry logic
result = run_scraper( "https://example.com" , "Extract data" )
Token Limit Exceeded
Error:
InvalidRequestError: This model's maximum context length is 8192 tokens
Solution:
Scraping Issues
Timeout Errors
Error:
TimeoutError: Page load exceeded timeout of 30 seconds
Solution:
fetch_node = FetchNode(
input = "url" ,
output = [ "doc" ],
node_config = {
"timeout" : 60 , # Increase timeout to 60 seconds
"headless" : True ,
"loader_kwargs" : {
"wait_until" : "networkidle" , # Wait for network to be idle
},
},
)
JavaScript-Heavy Sites
Problem: Content not loading because JavaScript isn’t executed.
Solution:
fetch_node = FetchNode(
input = "url" ,
output = [ "doc" ],
node_config = {
"headless" : False , # Use headed browser for debugging
"loader_kwargs" : {
"wait_until" : "networkidle" ,
"timeout" : 30000 , # 30 seconds in milliseconds
},
},
)
Anti-Scraping Measures
Problem: Website blocks or detects the scraper.
Solution:
Check robots.txt
from scrapegraphai.nodes import RobotsNode
robot_node = RobotsNode(
input = "url" ,
output = [ "is_scrapable" ],
node_config = {
"llm_model" : llm_model,
"force_scraping" : False , # Respect robots.txt
},
)
Add Delays
import time
urls = [ "url1" , "url2" , "url3" ]
for url in urls:
result = scraper.run()
time.sleep( 2 ) # 2-second delay between requests
Use Browser Profiles
fetch_node = FetchNode(
input = "url" ,
output = [ "doc" ],
node_config = {
"storage_state" : "./browser_state.json" , # Persist cookies/auth
},
)
Custom Node Issues
Error:
ValueError: Adjacent state keys found without an operator between them
Solution:
# Incorrect - missing operator
input = "url user_prompt"
# Correct - use & or |
input = "url & user_prompt" # Both required
input = "url | user_prompt" # Either one
Output Not Updating State
Problem: Node executes but state doesn’t contain expected keys.
Solution:
class MyCustomNode ( BaseNode ):
def execute ( self , state : dict ) -> dict :
# Process data
result = self ._process(state)
# CRITICAL: Update state with output keys
state.update({ self .output[ 0 ]: result})
# Return modified state
return state
Debugging Techniques
Enable Logging
import logging
from scrapegraphai.utils.logging import get_logger
# Set log level
logger = get_logger()
logger.setLevel(logging. DEBUG )
# Add console handler
handler = logging.StreamHandler()
handler.setLevel(logging. DEBUG )
formatter = logging.Formatter( ' %(asctime)s - %(name)s - %(levelname)s - %(message)s ' )
handler.setFormatter(formatter)
logger.addHandler(handler)
Inspect State at Each Node
class DebugNode ( BaseNode ):
"""Debug node to inspect state."""
def __init__ ( self , input , output , node_config = None , node_name = "Debug" ):
super (). __init__ (node_name, "node" , input , output, 0 , node_config)
def execute ( self , state : dict ) -> dict :
print ( " \n === State Debug ===" )
for key, value in state.items():
print ( f " { key } : { type (value) } - { str (value)[: 100 ] } ..." )
print ( "================== \n " )
return state
# Insert debug node between nodes
graph = BaseGraph(
nodes = [fetch_node, debug_node, parse_node],
edges = [
(fetch_node, debug_node),
(debug_node, parse_node),
],
entry_point = fetch_node,
)
Test Nodes in Isolation
def test_fetch_node ():
"""Test FetchNode independently."""
fetch_node = FetchNode(
input = "url" ,
output = [ "doc" ],
node_config = { "verbose" : True }
)
state = { "url" : "https://example.com" }
result = fetch_node.execute(state)
assert "doc" in result
assert len (result[ "doc" ]) > 0
print ( "FetchNode test passed" )
test_fetch_node()
Use Try-Except Blocks
try :
result, execution_info = graph.execute(initial_state)
except Exception as e:
print ( f "Error type: { type (e). __name__ } " )
print ( f "Error message: { str (e) } " )
import traceback
traceback.print_exc()
# Inspect state at failure
print ( " \n Current state:" )
print (initial_state)
Slow Execution
1. Use Faster Models graph_config = {
"llm" : {
"model" : "openai/gpt-3.5-turbo" , # Faster than GPT-4
},
}
2. Reduce Chunk Size parse_node = ParseNode(
input = "doc" ,
output = [ "parsed_doc" ],
node_config = {
"chunk_size" : 2048 , # Smaller = faster
},
)
3. Skip Unnecessary Nodes # If you don't need RAG, remove it
graph = BaseGraph(
nodes = [fetch_node, parse_node, generate_node], # Skip RAG
edges = [
(fetch_node, parse_node),
(parse_node, generate_node), # Direct connection
],
entry_point = fetch_node,
)
4. Parallelize Multiple Scrapes from concurrent.futures import ThreadPoolExecutor
def scrape_url ( url ):
scraper = SmartScraperGraph(
prompt = "Extract data" ,
source = url,
config = graph_config,
)
return scraper.run()
urls = [ "url1" , "url2" , "url3" ]
with ThreadPoolExecutor( max_workers = 3 ) as executor:
results = list (executor.map(scrape_url, urls))
Memory Issues
Problem: High memory usage with large documents.
Solution:
import gc
for url in urls:
result = scraper.run()
# Process result
# Force garbage collection
gc.collect()
Getting Help
Before Asking for Help
Minimal Reproducible Example
Create a minimal script that reproduces the issue: # Minimal reproduction
from scrapegraphai.graphs import SmartScraperGraph
scraper = SmartScraperGraph(
prompt = "Extract title" ,
source = "https://example.com" ,
config = {
"llm" : { "model" : "openai/gpt-4o" },
"verbose" : True ,
},
)
result = scraper.run()
print (result)
Gather Information
Include:
ScrapeGraphAI version: pip show scrapegraphai
Python version: python --version
Operating system
Complete error message and stack trace
Code that reproduces the issue
FAQ
Why is my scraper returning null or empty results?
Check that:
Your prompt is clear and specific
The URL is accessible and contains the expected content
JavaScript has time to load (increase timeout)
You’re using verbose: True to see what’s happening
How do I scrape JavaScript-heavy websites?
Use the FetchNode with appropriate wait conditions: node_config = {
"loader_kwargs" : {
"wait_until" : "networkidle" ,
"timeout" : 30000 ,
},
}
Can I use local LLMs instead of OpenAI?
Yes! Use Ollama or other local models: graph_config = {
"llm" : {
"model" : "ollama/llama3.2" ,
"base_url" : "http://localhost:11434" ,
},
}
How do I handle CAPTCHAs?
CAPTCHAs typically require manual solving. Consider:
Using authenticated sessions (cookies)
Using the storage_state option to persist auth
Third-party CAPTCHA solving services
Checking if the site offers an API
My graph is slow. How can I speed it up?
Use faster models (gpt-3.5-turbo vs gpt-4)
Reduce chunk sizes
Remove unnecessary nodes
Parallelize multiple scrapes
Use caching for repeated scrapes
Next Steps