Skip to main content

Overview

Skyvern can extract structured data from web pages during task execution. By providing a data_extraction_schema, you ensure that Skyvern returns data in a consistent, predictable format that matches your application’s requirements.

Basic Data Extraction

The simplest way to extract data is to include it in your prompt:
from skyvern import Skyvern

skyvern = Skyvern(api_key="your-api-key")

task = await skyvern.run_task(
    prompt="Find the top post on hackernews today and extract its title, URL, and points",
    url="https://news.ycombinator.com",
    wait_for_completion=True
)

# Access extracted data
print(task.output)
However, this approach doesn’t guarantee a consistent schema. For production use cases, always specify a data_extraction_schema.

Using JSON Schema for Structured Extraction

Simple Schema Example

Define the exact structure you want using JSON Schema:
from skyvern import Skyvern

skyvern = Skyvern(api_key="your-api-key")

task = await skyvern.run_task(
    prompt="Find the top post on hackernews today",
    url="https://news.ycombinator.com",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "title": {
                "type": "string",
                "description": "The title of the top post"
            },
            "url": {
                "type": "string",
                "description": "The URL of the top post"
            },
            "points": {
                "type": "integer",
                "description": "Number of points the post has received"
            }
        }
    },
    wait_for_completion=True
)

# Extract data with consistent schema
data = task.output
print(f"Title: {data['title']}")
print(f"URL: {data['url']}")
print(f"Points: {data['points']}")

Complex Schema with Nested Objects

Extract more complex, nested data structures:
task = await skyvern.run_task(
    prompt="Extract product details from this page",
    url="https://example.com/product/12345",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "product_name": {
                "type": "string",
                "description": "Name of the product"
            },
            "price": {
                "type": "number",
                "description": "Price in USD"
            },
            "availability": {
                "type": "string",
                "description": "Stock status (in_stock, out_of_stock, limited)"
            },
            "specifications": {
                "type": "object",
                "properties": {
                    "weight": {"type": "string"},
                    "dimensions": {"type": "string"},
                    "color": {"type": "string"}
                }
            },
            "reviews": {
                "type": "object",
                "properties": {
                    "average_rating": {"type": "number"},
                    "total_reviews": {"type": "integer"}
                }
            }
        }
    },
    wait_for_completion=True
)

Extracting Arrays of Data

Extract lists of items from a page:
task = await skyvern.run_task(
    prompt="Extract all products from the search results",
    url="https://example.com/search?q=laptop",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "description": "List of all products on the page",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "number"},
                        "rating": {"type": "number"},
                        "url": {"type": "string"}
                    }
                }
            },
            "total_results": {
                "type": "integer",
                "description": "Total number of results found"
            }
        }
    },
    wait_for_completion=True
)

# Process extracted products
for product in task.output['products']:
    print(f"{product['name']}: ${product['price']}")

Real-World Examples

Example 1: E-Commerce Price Extraction

task = await skyvern.run_task(
    prompt="Get the current price and availability for this product",
    url="https://www.example-store.com/products/laptop-abc123",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "product_id": {"type": "string"},
            "current_price": {
                "type": "number",
                "description": "Current price in USD"
            },
            "original_price": {
                "type": "number",
                "description": "Original price before discount"
            },
            "discount_percentage": {"type": "number"},
            "in_stock": {"type": "boolean"},
            "shipping_cost": {"type": "number"},
            "estimated_delivery": {"type": "string"}
        }
    },
    wait_for_completion=True
)

Example 2: Invoice Data Extraction

task = await skyvern.run_task(
    prompt="Extract invoice details from this page",
    url="https://vendor-portal.com/invoice/INV-2024-001",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "invoice_date": {"type": "string"},
            "due_date": {"type": "string"},
            "total_amount": {"type": "number"},
            "currency": {"type": "string"},
            "vendor_name": {"type": "string"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "integer"},
                        "unit_price": {"type": "number"},
                        "total": {"type": "number"}
                    }
                }
            },
            "payment_terms": {"type": "string"},
            "status": {"type": "string"}
        }
    },
    wait_for_completion=True
)

Example 3: Job Listing Extraction

task = await skyvern.run_task(
    prompt="Extract all job listings from this page",
    url="https://careers.example.com/jobs",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "jobs": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "department": {"type": "string"},
                        "location": {"type": "string"},
                        "employment_type": {"type": "string"},
                        "experience_level": {"type": "string"},
                        "salary_range": {"type": "string"},
                        "posted_date": {"type": "string"},
                        "application_url": {"type": "string"}
                    }
                }
            }
        }
    },
    wait_for_completion=True
)

Example 4: Insurance Quote Extraction

From the README, here’s a real example of extracting insurance quote data:
task = await skyvern.run_task(
    prompt="Get a quote for car insurance",
    url="https://www.geico.com",
    data_extraction_schema={
        "type": "object",
        "properties": {
            "quote_amount": {
                "type": "number",
                "description": "The quoted premium amount"
            },
            "coverage_type": {
                "type": "string",
                "description": "Type of coverage (liability, comprehensive, collision)"
            },
            "deductible": {
                "type": "number",
                "description": "Deductible amount"
            },
            "policy_term": {
                "type": "string",
                "description": "Policy term (6 months, 1 year)"
            },
            "quote_id": {
                "type": "string",
                "description": "Reference ID for the quote"
            }
        }
    },
    wait_for_completion=True
)

Data Types Supported

Skyvern’s data extraction supports all standard JSON Schema types:
TypeDescriptionExample
stringText data"Hello World"
integerWhole numbers42
numberFloating point numbers3.14
booleanTrue/false valuestrue
arrayLists of items[1, 2, 3]
objectNested structures{"key": "value"}
nullNull valuesnull

Best Practices

1. Always Include Descriptions

Provide clear descriptions for each field to help Skyvern understand what to extract:
{
    "title": {
        "type": "string",
        "description": "The main heading or title of the article"  # Good
    }
}

2. Use Specific Field Names

Use descriptive, unambiguous field names:
# Good
"product_price_usd": {"type": "number"}

# Less clear
"price": {"type": "number"}

3. Validate Data Types

Use appropriate data types to ensure correct parsing:
# Correct
"quantity": {"type": "integer"}  # For whole numbers
"price": {"type": "number"}      # For decimals
"in_stock": {"type": "boolean"}  # For yes/no

4. Handle Missing Data

Plan for fields that might not always be present:
{
    "type": "object",
    "properties": {
        "price": {"type": "number"},
        "discount_price": {
            "type": ["number", "null"],  # Allow null if no discount
            "description": "Discounted price, null if not on sale"
        }
    }
}

5. Keep Schemas Focused

Extract only what you need. Overly complex schemas can reduce accuracy:
# Good - focused extraction
data_extraction_schema = {
    "type": "object",
    "properties": {
        "price": {"type": "number"},
        "availability": {"type": "string"}
    }
}

# Avoid - too many fields may reduce accuracy
data_extraction_schema = {
    "type": "object",
    "properties": {
        # 20+ fields...
    }
}

Accessing Extracted Data

Python SDK

task = await skyvern.run_task(
    prompt="Extract product data",
    data_extraction_schema={...},
    wait_for_completion=True
)

# Access the extracted data
if task.output:
    price = task.output.get('price')
    title = task.output.get('title')
    print(f"{title}: ${price}")

TypeScript SDK

const task = await skyvern.runTask({
  prompt: "Extract product data",
  dataExtractionSchema: {...}
});

if (task.output) {
  console.log(`Price: ${task.output.price}`);
  console.log(`Title: ${task.output.title}`);
}

REST API

After creating a task, poll for completion and retrieve the output:
# Get task results
curl -X GET "https://api.skyvern.com/api/v1/runs/tsk_123456" \
  -H "Authorization: Bearer YOUR_API_KEY"
Response:
{
  "run_id": "tsk_123456",
  "status": "completed",
  "output": {
    "title": "Example Product",
    "price": 29.99,
    "availability": "in_stock"
  }
}

Next Steps

Task Parameters

Learn about all available task configuration options

Monitoring Runs

Monitor task execution and view results

Build docs developers (and LLMs) love