Documentation Index
Fetch the complete documentation index at: https://mintlify.com/reductoai/reducto-python-sdk/llms.txt
Use this file to discover all available pages before exploring further.
The Extract API allows you to extract specific structured data from documents using natural language instructions or schemas.
Basic Usage
from reducto import Reducto
client = Reducto()
response = client.extract.run(
input="https://example.com/invoice.pdf",
instructions={
"schema": {
"invoice_number": "string",
"date": "string",
"total_amount": "number",
"line_items": "array"
}
}
)
print(response)
import asyncio
from reducto import AsyncReducto
client = AsyncReducto()
async def main():
response = await client.extract.run(
input="https://example.com/invoice.pdf",
instructions={
"schema": {
"invoice_number": "string",
"date": "string",
"total_amount": "number",
"line_items": "array"
}
}
)
print(response)
asyncio.run(main())
Method Signature
client.extract.run(
input: str,
instructions: dict | None = None,
parsing: ParseOptions | None = None,
settings: dict | None = None,
async_: ConfigV3AsyncConfig | None = None
) -> ExtractRunResponse
Parameters
The URL of the document to extract from. You can provide:
- A publicly available URL
- A presigned S3 URL
- A
reducto:// prefixed URL from the /upload endpoint
- A
jobid:// prefixed URL from a previous parse invocation
- A list of URLs (for multi-document pipelines, V3 API only)
Instructions for data extraction. Can be either:
- A schema object defining the structure to extract
- Natural language instructions describing what to extract
Configuration options for parsing the document. If you’re passing in a jobid:// URL, this will be ignored.
Settings to control the extraction process.
Configuration for asynchronous processing. When provided, returns immediately with a job ID.
Define a schema to extract specific fields:
from reducto import Reducto
client = Reducto()
# Define extraction schema
schema = {
"company_name": "string",
"revenue": "number",
"employees": "number",
"founded_date": "string",
"headquarters": {
"city": "string",
"country": "string"
}
}
response = client.extract.run(
input="https://example.com/company-report.pdf",
instructions={"schema": schema}
)
# Access extracted data
print(response.data)
import asyncio
from reducto import AsyncReducto
client = AsyncReducto()
async def main():
# Define extraction schema
schema = {
"company_name": "string",
"revenue": "number",
"employees": "number",
"founded_date": "string",
"headquarters": {
"city": "string",
"country": "string"
}
}
response = await client.extract.run(
input="https://example.com/company-report.pdf",
instructions={"schema": schema}
)
# Access extracted data
print(response.data)
asyncio.run(main())
Natural Language Instructions
Use natural language to describe what to extract:
from reducto import Reducto
client = Reducto()
response = client.extract.run(
input="https://example.com/contract.pdf",
instructions={
"prompt": "Extract all parties involved in the contract, the contract start date, end date, and key obligations for each party."
}
)
print(response.data)
Combine extraction with custom parsing options:
from reducto import Reducto
client = Reducto()
response = client.extract.run(
input="https://example.com/document.pdf",
instructions={
"schema": {
"section_titles": "array",
"key_figures": "array"
}
},
parsing={
"enhance": {
"summarize_figures": True
},
"formatting": {
"add_page_markers": True
}
}
)
Async Job Processing
For large documents or batch processing, use async jobs:
from reducto import Reducto
client = Reducto()
# Start an async extraction job
job = client.extract.run_job(
input="https://example.com/large-document.pdf",
instructions={
"schema": {
"field1": "string",
"field2": "number"
}
},
async_={
"webhook": {"url": "https://example.com/webhook"}
}
)
print(f"Job ID: {job.job_id}")
# Poll for results
result = client.job.get(job.job_id)
Reusing Parsed Documents
Extract from a document that was previously parsed:
from reducto import Reducto
client = Reducto()
# First parse the document
parse_response = client.parse.run(
input="https://example.com/document.pdf"
)
# Then extract using the job ID (no re-parsing needed)
extract_response = client.extract.run(
input=f"jobid://{parse_response.job_id}",
instructions={
"schema": {"key_data": "string"}
}
)
Complex Schema Example
from reducto import Reducto
client = Reducto()
# Extract structured data from financial statements
schema = {
"company_info": {
"name": "string",
"ticker": "string",
"fiscal_year": "string"
},
"financial_metrics": {
"revenue": "number",
"net_income": "number",
"eps": "number",
"operating_expenses": "number"
},
"balance_sheet": {
"total_assets": "number",
"total_liabilities": "number",
"shareholders_equity": "number"
},
"key_risks": "array"
}
response = client.extract.run(
input="https://example.com/10k-filing.pdf",
instructions={"schema": schema}
)
print(response.data)