Overview
Theextract() function is the core API for extracting structured information from text using language models. It processes text through an LLM based on your instructions and examples, returning annotated documents with extracted entities.
Function Signature
Parameters
Required Parameters
The source text to extract information from. Can be:
- A string of text to analyze
- A URL starting with
http://orhttps://(whenfetch_urls=True) - An iterable of
Documentobjects for batch processing
List of
ExampleData objects that guide the extraction. These few-shot examples show the model what information to extract and how to structure it.Core Parameters
Instructions for what information to extract from the text. This natural language description guides the model’s extraction behavior.Example:
"Extract people and their employers"The model ID to use for extraction (e.g.,
'gemini-2.5-flash', 'gpt-4'). If your model ID is not recognized or you need a custom provider, use the config parameter with factory.ModelConfig to specify the provider explicitly.API key for Gemini or other LLM services. Can also be set via the
LANGEXTRACT_API_KEY environment variable.Cost Considerations: Most APIs charge by token volume. Smaller
max_char_buffer values increase the number of API calls, while extraction_passes > 1 reprocesses tokens multiple times. Note that max_workers improves processing speed without additional token costs. Monitor usage with small test runs to estimate costs.Model Configuration
Model configuration to use for extraction. Takes precedence over
model_id, api_key, and language_model_type parameters. When both model and config are provided, model takes precedence.Pre-configured language model instance to use for extraction. Takes precedence over all other parameters including
config.The type of language model to use for inference. Warning triggers when value differs from the legacy default.
Endpoint URL for self-hosted or on-premises models. Only forwarded when the selected
language_model_type accepts this argument.Additional parameters to pass to the language model constructor.
Output Format
The format type for the output (JSON or YAML). When
None, defaults to FormatType.JSON.Whether to expect/generate fenced output (
```json or ```yaml).- When
True: model generates fenced output and resolver expects it - When
False: raw JSON/YAML is expected - When
None(default): automatically determined based on provider schema capabilities
False unless the constraint also accounts for code fence delimiters.Whether to generate schema constraints for models. For supported models, this enables structured outputs.
Generation Parameters
The sampling temperature for generation.
- When
None(default): uses the model’s default temperature - Set to
0.0for deterministic output - Higher values produce more variation
Processing Parameters
Maximum number of characters for inference per chunk. Larger values mean fewer API calls but may exceed model context limits.
Number of text chunks processed per batch. Higher values enable greater parallelization when
batch_length >= max_workers.Maximum parallel workers for concurrent processing. Effective parallelization is limited by
min(batch_length, max_workers). Supported by Gemini models.Number of sequential extraction attempts to improve recall and find additional entities. When
> 1, the system performs multiple independent extractions and merges non-overlapping results (first extraction wins for overlaps).Number of characters from the previous chunk to include as context for the current chunk. This helps with coreference resolution across chunk boundaries (e.g., resolving “She” to a person mentioned in the previous chunk).
Additional context to be added to the prompt during inference.
Resolver Parameters
Parameters for the
resolver.Resolver, which parses the raw language model output string into structured Extraction objects. This dictionary overrides default settings.Available keys:extraction_index_suffix(str | None): Suffix for keys indicating extraction order. Default isNone(order by appearance)enable_fuzzy_alignment(bool): Whether to use fuzzy matching if exact matching fails. Default isTruefuzzy_alignment_threshold(float): Minimum token overlap ratio for fuzzy match (0.0-1.0). Default is0.75accept_match_lesser(bool): Whether to accept partial exact matches. Default isTruesuppress_parse_errors(bool): Whether to suppress parsing errors and continue pipeline. Default isFalse
Validation Parameters
Controls pre-flight alignment checks on few-shot examples:
OFF: Skips validationWARNING: Logs issues but continuesERROR: Raises on failures
When
True and prompt_validation_level is ERROR, raises on non-exact matches (MATCH_FUZZY, MATCH_LESSER).Utility Parameters
Whether to automatically download content when the input is a URL string.
- When
True(default): strings starting withhttp://orhttps://are fetched - When
False: all strings are treated as literal text to analyze
This is a keyword-only parameter.
Whether to show progress bar during extraction.
Whether to enable debug logging. When
True, enables detailed logging of function calls, arguments, return values, and timing for the langextract namespace.Optional
Tokenizer instance to use for chunking and alignment. If None, defaults to RegexTokenizer.Returns
Returns a single
AnnotatedDocument when input is a string or URL, or a list of AnnotatedDocument objects when input is an iterable of Document objects.Each AnnotatedDocument contains:text: The original textextractions: List ofExtractionobjects with extracted entities- Each
Extractionincludes:extraction_class: The entity typeextraction_text: The extracted text spanattributes: Dictionary of extracted attributeschar_interval: Character positions in the original text
Exceptions
Raised when:
examplesisNoneor empty- No API key is provided or found in environment variables
Raised when URL download fails (when
fetch_urls=True and input is a URL).Raised when validation fails in
ERROR mode (when prompt_validation_level is set to ERROR).Examples
Basic Usage
Extract from URL
Multiple Extraction Passes
Custom Model Configuration
Batch Processing
With Context Windows
See Also
- visualize() - Visualize extraction results
- ExampleData - Learn about creating examples
- AnnotatedDocument - Understanding the return type