Overview
The resolver module provides functionality for parsing language model outputs into structured extractions and aligning them with the source text. It handles JSON/YAML parsing, fuzzy matching, and extraction alignment with token and character positions.
Module
from langextract import resolver
Classes
AbstractResolver
Abstract base class for resolvers.
class AbstractResolver(abc.ABC):
def __init__(
self,
fence_output: bool = True,
constraint: schema.Constraint = schema.Constraint(),
format_type: data.FormatType = data.FormatType.JSON
)
Whether to expect fenced output (json or yaml). When True, the resolver expects code fences. When False, raw JSON/YAML is expected.
constraint
Constraint
default:"Constraint()"
Applies constraint when decoding the output.
format_type
FormatType
default:"FormatType.JSON"
The format type for the output (JSON or YAML).
Abstract Methods:
resolve(): Parse input text into extractions
align(): Align extractions with source text positions
Resolver
Concrete resolver implementation for YAML/JSON-based extraction.
class Resolver(AbstractResolver):
def __init__(
self,
format_handler: FormatHandler | None = None,
extraction_index_suffix: str | None = None,
**kwargs
)
format_handler
FormatHandler | None
default:"None"
The format handler that knows how to parse output. If None, a default handler is created from kwargs.
Suffix identifying index keys that determine the ordering of extractions. For example, "_index" will sort by fields like "entity_index". If None, extractions are returned in appearance order.
Legacy parameters (fence_output, format_type, etc.) for backward compatibility. These create a FormatHandler if one is not provided.
Methods
resolve()
Parses LLM output text into structured extractions.
def resolve(
self,
input_text: str,
suppress_parse_errors: bool = False,
**kwargs
) -> Sequence[data.Extraction]
The input text to be processed (LLM output).
If True, log errors and return empty list instead of raising exceptions.
Additional keyword arguments.
Sequence of Extraction objects parsed from the input.
Raises: ResolverParsingError if the content cannot be parsed (unless suppress_parse_errors=True).
align()
Aligns extractions with source text, setting token/char intervals and alignment status.
def align(
self,
extractions: Sequence[data.Extraction],
source_text: str,
token_offset: int,
char_offset: int | None = None,
enable_fuzzy_alignment: bool = True,
fuzzy_alignment_threshold: float = 0.75,
accept_match_lesser: bool = True,
tokenizer_inst: Tokenizer | None = None,
**kwargs
) -> Iterator[data.Extraction]
Annotated extractions to align with the source text.
The text in which to align the extractions.
The token offset corresponding to the starting token index of the chunk.
The char offset corresponding to the starting character index of the chunk.
Whether to use fuzzy alignment when exact matching fails.
fuzzy_alignment_threshold
Minimum token overlap ratio for fuzzy alignment (0-1).
Whether to accept partial exact matches (MATCH_LESSER status).
tokenizer_inst
Tokenizer | None
default:"None"
Optional tokenizer instance.
Iterator yielding aligned extractions with updated intervals and alignment status.
Alignment Status Values:
MATCH_EXACT: Perfect token-level match
MATCH_LESSER: Partial exact match (extraction longer than matched text)
MATCH_FUZZY: Best overlap window meets threshold (≥ fuzzy_alignment_threshold)
None: No alignment found
Extracts and orders extraction data based on associated indexes.
def extract_ordered_extractions(
self,
extraction_data: Sequence[Mapping[str, ExtractionValueType]]
) -> Sequence[data.Extraction]
A list of dictionaries containing extraction class keys and their values, along with optional index keys.
Extractions sorted by the index attribute or by order of appearance.
Raises: ValueError if extraction text is not a string/integer/float, or if index is not an integer.
Helper Classes
WordAligner
Aligns words between two sequences of tokens using Python’s difflib.
class WordAligner:
def align_extractions(
self,
extraction_groups: Sequence[Sequence[data.Extraction]],
source_text: str,
token_offset: int = 0,
char_offset: int = 0,
delim: str = "\u241F",
enable_fuzzy_alignment: bool = True,
fuzzy_alignment_threshold: float = 0.75,
accept_match_lesser: bool = True,
tokenizer_impl: Tokenizer | None = None
) -> Sequence[Sequence[data.Extraction]]
A sequence of sequences, where each inner sequence contains Extraction objects.
The source text against which extractions are aligned.
Offset to add to token interval indices.
Offset to add to character interval positions.
Token used to separate multi-token extractions (Unicode unit separator).
Whether to use fuzzy alignment when exact matching fails.
fuzzy_alignment_threshold
Minimum token overlap ratio for fuzzy alignment.
Whether to accept partial exact matches.
tokenizer_impl
Tokenizer | None
default:"None"
Optional tokenizer instance.
return
Sequence[Sequence[Extraction]]
Sequence of extractions aligned with the source text, including token intervals.
Usage Examples
Basic Resolve and Align
from langextract.resolver import Resolver
from langextract.core.data import FormatType
# Create resolver
resolver = Resolver(format_type=FormatType.YAML)
# Parse LLM output
llm_output = """
extractions:
- person: John Smith
person_index: 1
- organization: Acme Corp
organization_index: 2
"""
extractions = resolver.resolve(llm_output)
for extraction in extractions:
print(f"{extraction.extraction_class}: {extraction.extraction_text}")
# Align with source text
source_text = "John Smith founded Acme Corp in 2020."
aligned = resolver.align(
extractions,
source_text,
token_offset=0,
char_offset=0
)
for extraction in aligned:
if extraction.char_interval:
start = extraction.char_interval.start_pos
end = extraction.char_interval.end_pos
print(f"{extraction.extraction_class}: '{source_text[start:end]}'")
print(f" Position: {start}-{end}")
print(f" Alignment: {extraction.alignment_status}")
Handling Parse Errors
from langextract.resolver import Resolver
from langextract.core.data import FormatType
resolver = Resolver(format_type=FormatType.JSON)
# Invalid JSON
invalid_output = "{'invalid': json}"
# Suppress errors and continue
extractions = resolver.resolve(
invalid_output,
suppress_parse_errors=True
)
print(f"Extracted {len(extractions)} items") # Returns empty list
# Or catch exceptions
try:
extractions = resolver.resolve(invalid_output)
except resolver.ResolverParsingError as e:
print(f"Parse error: {e}")
Fuzzy Alignment
from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType
resolver = Resolver(format_type=FormatType.YAML)
# Extraction text doesn't exactly match source
extractions = [
Extraction(
extraction_class="person",
extraction_text="Sarah Johnson", # Missing "Dr."
extraction_index=1
)
]
source_text = "Dr. Sarah Johnson is the lead researcher."
# Fuzzy alignment will find best match
aligned = list(resolver.align(
extractions,
source_text,
token_offset=0,
char_offset=0,
enable_fuzzy_alignment=True,
fuzzy_alignment_threshold=0.8
))
for extraction in aligned:
print(f"Status: {extraction.alignment_status}")
if extraction.char_interval:
start = extraction.char_interval.start_pos
end = extraction.char_interval.end_pos
print(f"Matched: '{source_text[start:end]}'")
Custom Index Suffix
from langextract.resolver import Resolver
from langextract.core.format_handler import FormatHandler
from langextract.core.data import FormatType
# Use custom index suffix for ordering
format_handler = FormatHandler(
format_type=FormatType.JSON,
use_wrapper=True,
wrapper_key="extractions"
)
resolver = Resolver(
format_handler=format_handler,
extraction_index_suffix="_order" # Use _order instead of _index
)
llm_output = '''
{
"extractions": [
{"entity": "second", "entity_order": 2},
{"entity": "first", "entity_order": 1}
]
}
'''
extractions = resolver.resolve(llm_output)
for extraction in extractions:
print(f"{extraction.extraction_index}: {extraction.extraction_text}")
# Output:
# 1: first
# 2: second
Disable Fuzzy Alignment
from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType
resolver = Resolver(format_type=FormatType.YAML)
extractions = [
Extraction(
extraction_class="entity",
extraction_text="inexact match",
extraction_index=1
)
]
source_text = "This is an exact match."
# Only exact matches will be aligned
aligned = list(resolver.align(
extractions,
source_text,
token_offset=0,
enable_fuzzy_alignment=False
))
for extraction in aligned:
if extraction.alignment_status is None:
print("No alignment found (fuzzy disabled)")
Working with Attributes
from langextract.resolver import Resolver
from langextract.core.format_handler import FormatHandler
from langextract.core.data import FormatType
format_handler = FormatHandler(
format_type=FormatType.YAML,
attribute_suffix="_attrs"
)
resolver = Resolver(format_handler=format_handler)
llm_output = """
extractions:
- person: Dr. Smith
person_index: 1
person_attrs:
title: Doctor
specialty: Cardiology
"""
extractions = resolver.resolve(llm_output)
for extraction in extractions:
print(f"{extraction.extraction_class}: {extraction.extraction_text}")
if extraction.attributes:
print(f" Attributes: {extraction.attributes}")
Token Offset Example
from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType
resolver = Resolver(format_type=FormatType.YAML)
# This is chunk 2 of a larger document
chunk_text = "is the lead researcher."
token_offset = 5 # Chunk starts at token 5 in full document
char_offset = 15 # Chunk starts at char 15 in full document
extractions = [
Extraction(
extraction_class="role",
extraction_text="lead researcher",
extraction_index=1
)
]
aligned = list(resolver.align(
extractions,
chunk_text,
token_offset=token_offset,
char_offset=char_offset
))
for extraction in aligned:
# Token and char intervals are relative to full document
print(f"Token interval: {extraction.token_interval}")
print(f"Char interval: {extraction.char_interval}")
Notes
- The resolver uses difflib’s SequenceMatcher for exact alignment
- Fuzzy alignment scans all candidate windows for best overlap ratio
- Token normalization applies light stemming (removes trailing ‘s’) to improve matching
- Use
extraction_index_suffix to control extraction ordering
- Set
suppress_parse_errors=True to continue processing despite parse failures
- Alignment status helps identify extraction quality for filtering or validation
- WordAligner uses a delimiter (Unicode unit separator) to separate extractions
- Character and token offsets allow mapping chunks back to original document positions