extract()

Overview

The extract() function is the core API for extracting structured information from text using language models. It processes text through an LLM based on your instructions and examples, returning annotated documents with extracted entities.

import langextract as lx

result = lx.extract(
    text="John Smith works at Google in Mountain View.",
    prompt_description="Extract people and their employers",
    examples=[...],
    model_id="gemini-2.5-flash"
)

Function Signature

def extract(
    text_or_documents: typing.Any,
    prompt_description: str | None = None,
    examples: typing.Sequence[typing.Any] | None = None,
    model_id: str = "gemini-2.5-flash",
    api_key: str | None = None,
    language_model_type: typing.Type[typing.Any] | None = None,
    format_type: typing.Any = None,
    max_char_buffer: int = 1000,
    temperature: float | None = None,
    fence_output: bool | None = None,
    use_schema_constraints: bool = True,
    batch_length: int = 10,
    max_workers: int = 10,
    additional_context: str | None = None,
    resolver_params: dict | None = None,
    language_model_params: dict | None = None,
    debug: bool = False,
    model_url: str | None = None,
    extraction_passes: int = 1,
    context_window_chars: int | None = None,
    config: typing.Any = None,
    model: typing.Any = None,
    *,
    fetch_urls: bool = True,
    prompt_validation_level: pv.PromptValidationLevel = pv.PromptValidationLevel.WARNING,
    prompt_validation_strict: bool = False,
    show_progress: bool = True,
    tokenizer: tokenizer_lib.Tokenizer | None = None,
) -> list[data.AnnotatedDocument] | data.AnnotatedDocument

Parameters

Required Parameters

text_or_documents

str | URL | Iterable[Document]

required

The source text to extract information from. Can be:

A string of text to analyze
A URL starting with http:// or https:// (when fetch_urls=True)
An iterable of Document objects for batch processing

examples

Sequence[ExampleData]

required

List of ExampleData objects that guide the extraction. These few-shot examples show the model what information to extract and how to structure it.

This parameter is required. The function will raise a ValueError if not provided.

Core Parameters

prompt_description

str | None

default:"None"

Instructions for what information to extract from the text. This natural language description guides the model’s extraction behavior.Example: "Extract people and their employers"

model_id

str

default:"gemini-2.5-flash"

The model ID to use for extraction (e.g., 'gemini-2.5-flash', 'gpt-4'). If your model ID is not recognized or you need a custom provider, use the config parameter with factory.ModelConfig to specify the provider explicitly.

api_key

str | None

default:"None"

API key for Gemini or other LLM services. Can also be set via the LANGEXTRACT_API_KEY environment variable.

Cost Considerations: Most APIs charge by token volume. Smaller max_char_buffer values increase the number of API calls, while extraction_passes > 1 reprocesses tokens multiple times. Note that max_workers improves processing speed without additional token costs. Monitor usage with small test runs to estimate costs.

Model Configuration

config

ModelConfig | None

default:"None"

Model configuration to use for extraction. Takes precedence over model_id, api_key, and language_model_type parameters. When both model and config are provided, model takes precedence.

model

BaseLanguageModel | None

default:"None"

Pre-configured language model instance to use for extraction. Takes precedence over all other parameters including config.

language_model_type

Type[Any] | None

default:"None"

[DEPRECATED] This parameter is deprecated and will be removed in v2.0.0. Use model, config, or model_id parameters instead.

The type of language model to use for inference. Warning triggers when value differs from the legacy default.

model_url

str | None

default:"None"

Endpoint URL for self-hosted or on-premises models. Only forwarded when the selected language_model_type accepts this argument.

language_model_params

dict | None

default:"None"

Additional parameters to pass to the language model constructor.

Output Format

format_type

FormatType | None

default:"None"

The format type for the output (JSON or YAML). When None, defaults to FormatType.JSON.

fence_output

bool | None

default:"None"

Whether to expect/generate fenced output ( ```json or ```yaml).

When True: model generates fenced output and resolver expects it
When False: raw JSON/YAML is expected
When None (default): automatically determined based on provider schema capabilities

If your model utilizes schema constraints, this can generally be set to False unless the constraint also accounts for code fence delimiters.

use_schema_constraints

bool

default:"True"

Whether to generate schema constraints for models. For supported models, this enables structured outputs.

Generation Parameters

temperature

float | None

default:"None"

The sampling temperature for generation.

When None (default): uses the model’s default temperature
Set to 0.0 for deterministic output
Higher values produce more variation

Processing Parameters

max_char_buffer

int

default:"1000"

Maximum number of characters for inference per chunk. Larger values mean fewer API calls but may exceed model context limits.

batch_length

int

default:"10"

Number of text chunks processed per batch. Higher values enable greater parallelization when batch_length >= max_workers.

max_workers

int

default:"10"

Maximum parallel workers for concurrent processing. Effective parallelization is limited by min(batch_length, max_workers). Supported by Gemini models.

If batch_length < max_workers, only batch_length workers will be used. Set batch_length >= max_workers for optimal parallelization.

extraction_passes

int

default:"1"

Number of sequential extraction attempts to improve recall and find additional entities. When > 1, the system performs multiple independent extractions and merges non-overlapping results (first extraction wins for overlaps).

Cost Warning: Each additional pass reprocesses tokens, potentially increasing API costs. For example, extraction_passes=3 reprocesses tokens 3x.

context_window_chars

int | None

default:"None"

Number of characters from the previous chunk to include as context for the current chunk. This helps with coreference resolution across chunk boundaries (e.g., resolving “She” to a person mentioned in the previous chunk).

additional_context

str | None

default:"None"

Additional context to be added to the prompt during inference.

Resolver Parameters

resolver_params

dict | None

default:"None"

Parameters for the resolver.Resolver, which parses the raw language model output string into structured Extraction objects. This dictionary overrides default settings.Available keys:

extraction_index_suffix (str | None): Suffix for keys indicating extraction order. Default is None (order by appearance)
enable_fuzzy_alignment (bool): Whether to use fuzzy matching if exact matching fails. Default is True
fuzzy_alignment_threshold (float): Minimum token overlap ratio for fuzzy match (0.0-1.0). Default is 0.75
accept_match_lesser (bool): Whether to accept partial exact matches. Default is True
suppress_parse_errors (bool): Whether to suppress parsing errors and continue pipeline. Default is False

Validation Parameters

prompt_validation_level

PromptValidationLevel

default:"WARNING"

Controls pre-flight alignment checks on few-shot examples:

OFF: Skips validation
WARNING: Logs issues but continues
ERROR: Raises on failures

prompt_validation_strict

bool

default:"False"

When True and prompt_validation_level is ERROR, raises on non-exact matches (MATCH_FUZZY, MATCH_LESSER).

Utility Parameters

fetch_urls

bool

default:"True"

Whether to automatically download content when the input is a URL string.

When True (default): strings starting with http:// or https:// are fetched
When False: all strings are treated as literal text to analyze

This is a keyword-only parameter.

show_progress

bool

default:"True"

Whether to show progress bar during extraction.

debug

bool

default:"False"

Whether to enable debug logging. When True, enables detailed logging of function calls, arguments, return values, and timing for the langextract namespace.

Debug logging remains enabled for the process once activated.

tokenizer

Tokenizer | None

default:"None"

Optional Tokenizer instance to use for chunking and alignment. If None, defaults to RegexTokenizer.

Returns

result

AnnotatedDocument | list[AnnotatedDocument]

Returns a single AnnotatedDocument when input is a string or URL, or a list of AnnotatedDocument objects when input is an iterable of Document objects.Each AnnotatedDocument contains:

text: The original text
extractions: List of Extraction objects with extracted entities
Each Extraction includes:
- extraction_class: The entity type
- extraction_text: The extracted text span
- attributes: Dictionary of extracted attributes
- char_interval: Character positions in the original text

Exceptions

ValueError

exception

Raised when:

examples is None or empty
No API key is provided or found in environment variables

requests.RequestException

exception

Raised when URL download fails (when fetch_urls=True and input is a URL).

PromptAlignmentError

exception

Raised when validation fails in ERROR mode (when prompt_validation_level is set to ERROR).

Examples

Basic Usage

import langextract as lx

# Define examples
examples = [
    lx.data.ExampleData(
        text="Jane Doe works at Apple.",
        extractions=[
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="Jane Doe",
                attributes={"name": "Jane Doe"},
                char_interval=lx.data.CharInterval(0, 8)
            ),
            lx.data.Extraction(
                extraction_class="company",
                extraction_text="Apple",
                attributes={"name": "Apple"},
                char_interval=lx.data.CharInterval(18, 23)
            )
        ]
    )
]

# Extract from text
result = lx.extract(
    text="John Smith works at Google in Mountain View.",
    prompt_description="Extract people and companies",
    examples=examples,
    api_key="your-api-key"
)

print(result.extractions)

Extract from URL

import langextract as lx

result = lx.extract(
    text_or_documents="https://example.com/article.html",
    prompt_description="Extract key facts",
    examples=examples,
    fetch_urls=True  # Automatically downloads content
)

Multiple Extraction Passes

import langextract as lx

# Perform 3 extraction passes to improve recall
result = lx.extract(
    text=long_document,
    prompt_description="Extract all mentions of products",
    examples=examples,
    extraction_passes=3  # Will make 3 passes and merge results
)

Custom Model Configuration

import langextract as lx

config = lx.factory.ModelConfig(
    model_id="gpt-4",
    provider_kwargs={
        "api_key": "your-openai-key",
        "temperature": 0.0
    }
)

result = lx.extract(
    text=text,
    prompt_description="Extract entities",
    examples=examples,
    config=config
)

Batch Processing

import langextract as lx

# Create documents
documents = [
    lx.data.Document(text="Document 1 text..."),
    lx.data.Document(text="Document 2 text..."),
    lx.data.Document(text="Document 3 text...")
]

# Process all documents
results = lx.extract(
    text_or_documents=documents,
    prompt_description="Extract entities",
    examples=examples,
    max_workers=10  # Process in parallel
)

for doc in results:
    print(f"Found {len(doc.extractions)} extractions")

With Context Windows

import langextract as lx

# Use context windows to resolve references across chunks
result = lx.extract(
    text=long_text,
    prompt_description="Extract people and their actions",
    examples=examples,
    max_char_buffer=500,
    context_window_chars=100  # Include 100 chars from previous chunk
)

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

Overview

Function Signature

Parameters

Required Parameters

Core Parameters

Model Configuration

Output Format

Generation Parameters

Processing Parameters

Resolver Parameters

Validation Parameters

Utility Parameters

Returns

Exceptions

Examples

Basic Usage

Extract from URL

Multiple Extraction Passes

Custom Model Configuration

Batch Processing

With Context Windows

See Also

Build docs developers (and LLMs) love

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

​Overview

​Function Signature

​Parameters

​Required Parameters

​Core Parameters

​Model Configuration

​Output Format

​Generation Parameters

​Processing Parameters

​Resolver Parameters

​Validation Parameters

​Utility Parameters

​Returns

​Exceptions

​Examples

​Basic Usage

​Extract from URL

​Multiple Extraction Passes

​Custom Model Configuration

​Batch Processing

​With Context Windows

​See Also

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Required Parameters

Core Parameters

Model Configuration

Output Format

Generation Parameters

Processing Parameters

Resolver Parameters

Validation Parameters

Utility Parameters

Returns

Exceptions

Examples

Basic Usage

Extract from URL

Multiple Extraction Passes

Custom Model Configuration

Batch Processing

With Context Windows

See Also