Skip to main content

Overview

The extract() function is the core API for extracting structured information from text using language models. It processes text through an LLM based on your instructions and examples, returning annotated documents with extracted entities.
import langextract as lx

result = lx.extract(
    text="John Smith works at Google in Mountain View.",
    prompt_description="Extract people and their employers",
    examples=[...],
    model_id="gemini-2.5-flash"
)

Function Signature

def extract(
    text_or_documents: typing.Any,
    prompt_description: str | None = None,
    examples: typing.Sequence[typing.Any] | None = None,
    model_id: str = "gemini-2.5-flash",
    api_key: str | None = None,
    language_model_type: typing.Type[typing.Any] | None = None,
    format_type: typing.Any = None,
    max_char_buffer: int = 1000,
    temperature: float | None = None,
    fence_output: bool | None = None,
    use_schema_constraints: bool = True,
    batch_length: int = 10,
    max_workers: int = 10,
    additional_context: str | None = None,
    resolver_params: dict | None = None,
    language_model_params: dict | None = None,
    debug: bool = False,
    model_url: str | None = None,
    extraction_passes: int = 1,
    context_window_chars: int | None = None,
    config: typing.Any = None,
    model: typing.Any = None,
    *,
    fetch_urls: bool = True,
    prompt_validation_level: pv.PromptValidationLevel = pv.PromptValidationLevel.WARNING,
    prompt_validation_strict: bool = False,
    show_progress: bool = True,
    tokenizer: tokenizer_lib.Tokenizer | None = None,
) -> list[data.AnnotatedDocument] | data.AnnotatedDocument

Parameters

Required Parameters

text_or_documents
str | URL | Iterable[Document]
required
The source text to extract information from. Can be:
  • A string of text to analyze
  • A URL starting with http:// or https:// (when fetch_urls=True)
  • An iterable of Document objects for batch processing
examples
Sequence[ExampleData]
required
List of ExampleData objects that guide the extraction. These few-shot examples show the model what information to extract and how to structure it.
This parameter is required. The function will raise a ValueError if not provided.

Core Parameters

prompt_description
str | None
default:"None"
Instructions for what information to extract from the text. This natural language description guides the model’s extraction behavior.Example: "Extract people and their employers"
model_id
str
default:"gemini-2.5-flash"
The model ID to use for extraction (e.g., 'gemini-2.5-flash', 'gpt-4'). If your model ID is not recognized or you need a custom provider, use the config parameter with factory.ModelConfig to specify the provider explicitly.
api_key
str | None
default:"None"
API key for Gemini or other LLM services. Can also be set via the LANGEXTRACT_API_KEY environment variable.
Cost Considerations: Most APIs charge by token volume. Smaller max_char_buffer values increase the number of API calls, while extraction_passes > 1 reprocesses tokens multiple times. Note that max_workers improves processing speed without additional token costs. Monitor usage with small test runs to estimate costs.

Model Configuration

config
ModelConfig | None
default:"None"
Model configuration to use for extraction. Takes precedence over model_id, api_key, and language_model_type parameters. When both model and config are provided, model takes precedence.
model
BaseLanguageModel | None
default:"None"
Pre-configured language model instance to use for extraction. Takes precedence over all other parameters including config.
language_model_type
Type[Any] | None
default:"None"
[DEPRECATED] This parameter is deprecated and will be removed in v2.0.0. Use model, config, or model_id parameters instead.
The type of language model to use for inference. Warning triggers when value differs from the legacy default.
model_url
str | None
default:"None"
Endpoint URL for self-hosted or on-premises models. Only forwarded when the selected language_model_type accepts this argument.
language_model_params
dict | None
default:"None"
Additional parameters to pass to the language model constructor.

Output Format

format_type
FormatType | None
default:"None"
The format type for the output (JSON or YAML). When None, defaults to FormatType.JSON.
fence_output
bool | None
default:"None"
Whether to expect/generate fenced output ( ```json or ```yaml).
  • When True: model generates fenced output and resolver expects it
  • When False: raw JSON/YAML is expected
  • When None (default): automatically determined based on provider schema capabilities
If your model utilizes schema constraints, this can generally be set to False unless the constraint also accounts for code fence delimiters.
use_schema_constraints
bool
default:"True"
Whether to generate schema constraints for models. For supported models, this enables structured outputs.

Generation Parameters

temperature
float | None
default:"None"
The sampling temperature for generation.
  • When None (default): uses the model’s default temperature
  • Set to 0.0 for deterministic output
  • Higher values produce more variation

Processing Parameters

max_char_buffer
int
default:"1000"
Maximum number of characters for inference per chunk. Larger values mean fewer API calls but may exceed model context limits.
batch_length
int
default:"10"
Number of text chunks processed per batch. Higher values enable greater parallelization when batch_length >= max_workers.
max_workers
int
default:"10"
Maximum parallel workers for concurrent processing. Effective parallelization is limited by min(batch_length, max_workers). Supported by Gemini models.
If batch_length < max_workers, only batch_length workers will be used. Set batch_length >= max_workers for optimal parallelization.
extraction_passes
int
default:"1"
Number of sequential extraction attempts to improve recall and find additional entities. When > 1, the system performs multiple independent extractions and merges non-overlapping results (first extraction wins for overlaps).
Cost Warning: Each additional pass reprocesses tokens, potentially increasing API costs. For example, extraction_passes=3 reprocesses tokens 3x.
context_window_chars
int | None
default:"None"
Number of characters from the previous chunk to include as context for the current chunk. This helps with coreference resolution across chunk boundaries (e.g., resolving “She” to a person mentioned in the previous chunk).
additional_context
str | None
default:"None"
Additional context to be added to the prompt during inference.

Resolver Parameters

resolver_params
dict | None
default:"None"
Parameters for the resolver.Resolver, which parses the raw language model output string into structured Extraction objects. This dictionary overrides default settings.Available keys:
  • extraction_index_suffix (str | None): Suffix for keys indicating extraction order. Default is None (order by appearance)
  • enable_fuzzy_alignment (bool): Whether to use fuzzy matching if exact matching fails. Default is True
  • fuzzy_alignment_threshold (float): Minimum token overlap ratio for fuzzy match (0.0-1.0). Default is 0.75
  • accept_match_lesser (bool): Whether to accept partial exact matches. Default is True
  • suppress_parse_errors (bool): Whether to suppress parsing errors and continue pipeline. Default is False

Validation Parameters

prompt_validation_level
PromptValidationLevel
default:"WARNING"
Controls pre-flight alignment checks on few-shot examples:
  • OFF: Skips validation
  • WARNING: Logs issues but continues
  • ERROR: Raises on failures
prompt_validation_strict
bool
default:"False"
When True and prompt_validation_level is ERROR, raises on non-exact matches (MATCH_FUZZY, MATCH_LESSER).

Utility Parameters

fetch_urls
bool
default:"True"
Whether to automatically download content when the input is a URL string.
  • When True (default): strings starting with http:// or https:// are fetched
  • When False: all strings are treated as literal text to analyze
This is a keyword-only parameter.
show_progress
bool
default:"True"
Whether to show progress bar during extraction.
debug
bool
default:"False"
Whether to enable debug logging. When True, enables detailed logging of function calls, arguments, return values, and timing for the langextract namespace.
Debug logging remains enabled for the process once activated.
tokenizer
Tokenizer | None
default:"None"
Optional Tokenizer instance to use for chunking and alignment. If None, defaults to RegexTokenizer.

Returns

result
AnnotatedDocument | list[AnnotatedDocument]
Returns a single AnnotatedDocument when input is a string or URL, or a list of AnnotatedDocument objects when input is an iterable of Document objects.Each AnnotatedDocument contains:
  • text: The original text
  • extractions: List of Extraction objects with extracted entities
  • Each Extraction includes:
    • extraction_class: The entity type
    • extraction_text: The extracted text span
    • attributes: Dictionary of extracted attributes
    • char_interval: Character positions in the original text

Exceptions

ValueError
exception
Raised when:
  • examples is None or empty
  • No API key is provided or found in environment variables
requests.RequestException
exception
Raised when URL download fails (when fetch_urls=True and input is a URL).
PromptAlignmentError
exception
Raised when validation fails in ERROR mode (when prompt_validation_level is set to ERROR).

Examples

Basic Usage

import langextract as lx

# Define examples
examples = [
    lx.data.ExampleData(
        text="Jane Doe works at Apple.",
        extractions=[
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="Jane Doe",
                attributes={"name": "Jane Doe"},
                char_interval=lx.data.CharInterval(0, 8)
            ),
            lx.data.Extraction(
                extraction_class="company",
                extraction_text="Apple",
                attributes={"name": "Apple"},
                char_interval=lx.data.CharInterval(18, 23)
            )
        ]
    )
]

# Extract from text
result = lx.extract(
    text="John Smith works at Google in Mountain View.",
    prompt_description="Extract people and companies",
    examples=examples,
    api_key="your-api-key"
)

print(result.extractions)

Extract from URL

import langextract as lx

result = lx.extract(
    text_or_documents="https://example.com/article.html",
    prompt_description="Extract key facts",
    examples=examples,
    fetch_urls=True  # Automatically downloads content
)

Multiple Extraction Passes

import langextract as lx

# Perform 3 extraction passes to improve recall
result = lx.extract(
    text=long_document,
    prompt_description="Extract all mentions of products",
    examples=examples,
    extraction_passes=3  # Will make 3 passes and merge results
)

Custom Model Configuration

import langextract as lx

config = lx.factory.ModelConfig(
    model_id="gpt-4",
    provider_kwargs={
        "api_key": "your-openai-key",
        "temperature": 0.0
    }
)

result = lx.extract(
    text=text,
    prompt_description="Extract entities",
    examples=examples,
    config=config
)

Batch Processing

import langextract as lx

# Create documents
documents = [
    lx.data.Document(text="Document 1 text..."),
    lx.data.Document(text="Document 2 text..."),
    lx.data.Document(text="Document 3 text...")
]

# Process all documents
results = lx.extract(
    text_or_documents=documents,
    prompt_description="Extract entities",
    examples=examples,
    max_workers=10  # Process in parallel
)

for doc in results:
    print(f"Found {len(doc.extractions)} extractions")

With Context Windows

import langextract as lx

# Use context windows to resolve references across chunks
result = lx.extract(
    text=long_text,
    prompt_description="Extract people and their actions",
    examples=examples,
    max_char_buffer=500,
    context_window_chars=100  # Include 100 chars from previous chunk
)

See Also

Build docs developers (and LLMs) love