Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/reductoai/reducto-python-sdk/llms.txt

Use this file to discover all available pages before exploring further.

The Extract API allows you to extract specific structured data from documents using natural language instructions or schemas.

Basic Usage

from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/invoice.pdf",
    instructions={
        "schema": {
            "invoice_number": "string",
            "date": "string",
            "total_amount": "number",
            "line_items": "array"
        }
    }
)
print(response)

Method Signature

client.extract.run(
    input: str,
    instructions: dict | None = None,
    parsing: ParseOptions | None = None,
    settings: dict | None = None,
    async_: ConfigV3AsyncConfig | None = None
) -> ExtractRunResponse

Parameters

input
string
required
The URL of the document to extract from. You can provide:
  • A publicly available URL
  • A presigned S3 URL
  • A reducto:// prefixed URL from the /upload endpoint
  • A jobid:// prefixed URL from a previous parse invocation
  • A list of URLs (for multi-document pipelines, V3 API only)
instructions
object
Instructions for data extraction. Can be either:
  • A schema object defining the structure to extract
  • Natural language instructions describing what to extract
parsing
ParseOptions
Configuration options for parsing the document. If you’re passing in a jobid:// URL, this will be ignored.
settings
object
Settings to control the extraction process.
async_
ConfigV3AsyncConfig
Configuration for asynchronous processing. When provided, returns immediately with a job ID.

Schema-Based Extraction

Define a schema to extract specific fields:
from reducto import Reducto

client = Reducto()

# Define extraction schema
schema = {
    "company_name": "string",
    "revenue": "number",
    "employees": "number",
    "founded_date": "string",
    "headquarters": {
        "city": "string",
        "country": "string"
    }
}

response = client.extract.run(
    input="https://example.com/company-report.pdf",
    instructions={"schema": schema}
)

# Access extracted data
print(response.data)

Natural Language Instructions

Use natural language to describe what to extract:
from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/contract.pdf",
    instructions={
        "prompt": "Extract all parties involved in the contract, the contract start date, end date, and key obligations for each party."
    }
)

print(response.data)

Extract with Custom Parsing

Combine extraction with custom parsing options:
from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/document.pdf",
    instructions={
        "schema": {
            "section_titles": "array",
            "key_figures": "array"
        }
    },
    parsing={
        "enhance": {
            "summarize_figures": True
        },
        "formatting": {
            "add_page_markers": True
        }
    }
)

Async Job Processing

For large documents or batch processing, use async jobs:
from reducto import Reducto

client = Reducto()

# Start an async extraction job
job = client.extract.run_job(
    input="https://example.com/large-document.pdf",
    instructions={
        "schema": {
            "field1": "string",
            "field2": "number"
        }
    },
    async_={
        "webhook": {"url": "https://example.com/webhook"}
    }
)

print(f"Job ID: {job.job_id}")

# Poll for results
result = client.job.get(job.job_id)

Reusing Parsed Documents

Extract from a document that was previously parsed:
from reducto import Reducto

client = Reducto()

# First parse the document
parse_response = client.parse.run(
    input="https://example.com/document.pdf"
)

# Then extract using the job ID (no re-parsing needed)
extract_response = client.extract.run(
    input=f"jobid://{parse_response.job_id}",
    instructions={
        "schema": {"key_data": "string"}
    }
)

Complex Schema Example

from reducto import Reducto

client = Reducto()

# Extract structured data from financial statements
schema = {
    "company_info": {
        "name": "string",
        "ticker": "string",
        "fiscal_year": "string"
    },
    "financial_metrics": {
        "revenue": "number",
        "net_income": "number",
        "eps": "number",
        "operating_expenses": "number"
    },
    "balance_sheet": {
        "total_assets": "number",
        "total_liabilities": "number",
        "shareholders_equity": "number"
    },
    "key_risks": "array"
}

response = client.extract.run(
    input="https://example.com/10k-filing.pdf",
    instructions={"schema": schema}
)

print(response.data)

Build docs developers (and LLMs) love