Skip to main content

Overview

The resolver module provides functionality for parsing language model outputs into structured extractions and aligning them with the source text. It handles JSON/YAML parsing, fuzzy matching, and extraction alignment with token and character positions.

Module

from langextract import resolver

Classes

AbstractResolver

Abstract base class for resolvers.
class AbstractResolver(abc.ABC):
    def __init__(
        self,
        fence_output: bool = True,
        constraint: schema.Constraint = schema.Constraint(),
        format_type: data.FormatType = data.FormatType.JSON
    )
fence_output
bool
default:"True"
Whether to expect fenced output (json or yaml). When True, the resolver expects code fences. When False, raw JSON/YAML is expected.
constraint
Constraint
default:"Constraint()"
Applies constraint when decoding the output.
format_type
FormatType
default:"FormatType.JSON"
The format type for the output (JSON or YAML).
Abstract Methods:
  • resolve(): Parse input text into extractions
  • align(): Align extractions with source text positions

Resolver

Concrete resolver implementation for YAML/JSON-based extraction.
class Resolver(AbstractResolver):
    def __init__(
        self,
        format_handler: FormatHandler | None = None,
        extraction_index_suffix: str | None = None,
        **kwargs
    )
format_handler
FormatHandler | None
default:"None"
The format handler that knows how to parse output. If None, a default handler is created from kwargs.
extraction_index_suffix
str | None
default:"None"
Suffix identifying index keys that determine the ordering of extractions. For example, "_index" will sort by fields like "entity_index". If None, extractions are returned in appearance order.
**kwargs
Any
Legacy parameters (fence_output, format_type, etc.) for backward compatibility. These create a FormatHandler if one is not provided.

Methods

resolve()

Parses LLM output text into structured extractions.
def resolve(
    self,
    input_text: str,
    suppress_parse_errors: bool = False,
    **kwargs
) -> Sequence[data.Extraction]
input_text
str
required
The input text to be processed (LLM output).
suppress_parse_errors
bool
default:"False"
If True, log errors and return empty list instead of raising exceptions.
**kwargs
Any
Additional keyword arguments.
return
Sequence[Extraction]
Sequence of Extraction objects parsed from the input.
Raises: ResolverParsingError if the content cannot be parsed (unless suppress_parse_errors=True).

align()

Aligns extractions with source text, setting token/char intervals and alignment status.
def align(
    self,
    extractions: Sequence[data.Extraction],
    source_text: str,
    token_offset: int,
    char_offset: int | None = None,
    enable_fuzzy_alignment: bool = True,
    fuzzy_alignment_threshold: float = 0.75,
    accept_match_lesser: bool = True,
    tokenizer_inst: Tokenizer | None = None,
    **kwargs
) -> Iterator[data.Extraction]
extractions
Sequence[Extraction]
required
Annotated extractions to align with the source text.
source_text
str
required
The text in which to align the extractions.
token_offset
int
required
The token offset corresponding to the starting token index of the chunk.
char_offset
int | None
default:"None"
The char offset corresponding to the starting character index of the chunk.
enable_fuzzy_alignment
bool
default:"True"
Whether to use fuzzy alignment when exact matching fails.
fuzzy_alignment_threshold
float
default:"0.75"
Minimum token overlap ratio for fuzzy alignment (0-1).
accept_match_lesser
bool
default:"True"
Whether to accept partial exact matches (MATCH_LESSER status).
tokenizer_inst
Tokenizer | None
default:"None"
Optional tokenizer instance.
**kwargs
Any
Additional parameters.
return
Iterator[Extraction]
Iterator yielding aligned extractions with updated intervals and alignment status.
Alignment Status Values:
  • MATCH_EXACT: Perfect token-level match
  • MATCH_LESSER: Partial exact match (extraction longer than matched text)
  • MATCH_FUZZY: Best overlap window meets threshold (≥ fuzzy_alignment_threshold)
  • None: No alignment found

extract_ordered_extractions()

Extracts and orders extraction data based on associated indexes.
def extract_ordered_extractions(
    self,
    extraction_data: Sequence[Mapping[str, ExtractionValueType]]
) -> Sequence[data.Extraction]
extraction_data
Sequence[Mapping[str, ExtractionValueType]]
required
A list of dictionaries containing extraction class keys and their values, along with optional index keys.
return
Sequence[Extraction]
Extractions sorted by the index attribute or by order of appearance.
Raises: ValueError if extraction text is not a string/integer/float, or if index is not an integer.

Helper Classes

WordAligner

Aligns words between two sequences of tokens using Python’s difflib.
class WordAligner:
    def align_extractions(
        self,
        extraction_groups: Sequence[Sequence[data.Extraction]],
        source_text: str,
        token_offset: int = 0,
        char_offset: int = 0,
        delim: str = "\u241F",
        enable_fuzzy_alignment: bool = True,
        fuzzy_alignment_threshold: float = 0.75,
        accept_match_lesser: bool = True,
        tokenizer_impl: Tokenizer | None = None
    ) -> Sequence[Sequence[data.Extraction]]
extraction_groups
Sequence[Sequence[Extraction]]
required
A sequence of sequences, where each inner sequence contains Extraction objects.
source_text
str
required
The source text against which extractions are aligned.
token_offset
int
default:"0"
Offset to add to token interval indices.
char_offset
int
default:"0"
Offset to add to character interval positions.
delim
str
default:"\\u241F"
Token used to separate multi-token extractions (Unicode unit separator).
enable_fuzzy_alignment
bool
default:"True"
Whether to use fuzzy alignment when exact matching fails.
fuzzy_alignment_threshold
float
default:"0.75"
Minimum token overlap ratio for fuzzy alignment.
accept_match_lesser
bool
default:"True"
Whether to accept partial exact matches.
tokenizer_impl
Tokenizer | None
default:"None"
Optional tokenizer instance.
return
Sequence[Sequence[Extraction]]
Sequence of extractions aligned with the source text, including token intervals.

Usage Examples

Basic Resolve and Align

from langextract.resolver import Resolver
from langextract.core.data import FormatType

# Create resolver
resolver = Resolver(format_type=FormatType.YAML)

# Parse LLM output
llm_output = """
extractions:
  - person: John Smith
    person_index: 1
  - organization: Acme Corp
    organization_index: 2
"""

extractions = resolver.resolve(llm_output)

for extraction in extractions:
    print(f"{extraction.extraction_class}: {extraction.extraction_text}")

# Align with source text
source_text = "John Smith founded Acme Corp in 2020."
aligned = resolver.align(
    extractions,
    source_text,
    token_offset=0,
    char_offset=0
)

for extraction in aligned:
    if extraction.char_interval:
        start = extraction.char_interval.start_pos
        end = extraction.char_interval.end_pos
        print(f"{extraction.extraction_class}: '{source_text[start:end]}'")
        print(f"  Position: {start}-{end}")
        print(f"  Alignment: {extraction.alignment_status}")

Handling Parse Errors

from langextract.resolver import Resolver
from langextract.core.data import FormatType

resolver = Resolver(format_type=FormatType.JSON)

# Invalid JSON
invalid_output = "{'invalid': json}"

# Suppress errors and continue
extractions = resolver.resolve(
    invalid_output,
    suppress_parse_errors=True
)
print(f"Extracted {len(extractions)} items")  # Returns empty list

# Or catch exceptions
try:
    extractions = resolver.resolve(invalid_output)
except resolver.ResolverParsingError as e:
    print(f"Parse error: {e}")

Fuzzy Alignment

from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType

resolver = Resolver(format_type=FormatType.YAML)

# Extraction text doesn't exactly match source
extractions = [
    Extraction(
        extraction_class="person",
        extraction_text="Sarah Johnson",  # Missing "Dr."
        extraction_index=1
    )
]

source_text = "Dr. Sarah Johnson is the lead researcher."

# Fuzzy alignment will find best match
aligned = list(resolver.align(
    extractions,
    source_text,
    token_offset=0,
    char_offset=0,
    enable_fuzzy_alignment=True,
    fuzzy_alignment_threshold=0.8
))

for extraction in aligned:
    print(f"Status: {extraction.alignment_status}")
    if extraction.char_interval:
        start = extraction.char_interval.start_pos
        end = extraction.char_interval.end_pos
        print(f"Matched: '{source_text[start:end]}'")

Custom Index Suffix

from langextract.resolver import Resolver
from langextract.core.format_handler import FormatHandler
from langextract.core.data import FormatType

# Use custom index suffix for ordering
format_handler = FormatHandler(
    format_type=FormatType.JSON,
    use_wrapper=True,
    wrapper_key="extractions"
)

resolver = Resolver(
    format_handler=format_handler,
    extraction_index_suffix="_order"  # Use _order instead of _index
)

llm_output = '''
{
  "extractions": [
    {"entity": "second", "entity_order": 2},
    {"entity": "first", "entity_order": 1}
  ]
}
'''

extractions = resolver.resolve(llm_output)

for extraction in extractions:
    print(f"{extraction.extraction_index}: {extraction.extraction_text}")
# Output:
# 1: first
# 2: second

Disable Fuzzy Alignment

from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType

resolver = Resolver(format_type=FormatType.YAML)

extractions = [
    Extraction(
        extraction_class="entity",
        extraction_text="inexact match",
        extraction_index=1
    )
]

source_text = "This is an exact match."

# Only exact matches will be aligned
aligned = list(resolver.align(
    extractions,
    source_text,
    token_offset=0,
    enable_fuzzy_alignment=False
))

for extraction in aligned:
    if extraction.alignment_status is None:
        print("No alignment found (fuzzy disabled)")

Working with Attributes

from langextract.resolver import Resolver
from langextract.core.format_handler import FormatHandler
from langextract.core.data import FormatType

format_handler = FormatHandler(
    format_type=FormatType.YAML,
    attribute_suffix="_attrs"
)

resolver = Resolver(format_handler=format_handler)

llm_output = """
extractions:
  - person: Dr. Smith
    person_index: 1
    person_attrs:
      title: Doctor
      specialty: Cardiology
"""

extractions = resolver.resolve(llm_output)

for extraction in extractions:
    print(f"{extraction.extraction_class}: {extraction.extraction_text}")
    if extraction.attributes:
        print(f"  Attributes: {extraction.attributes}")

Token Offset Example

from langextract.resolver import Resolver
from langextract.core.data import Extraction, FormatType

resolver = Resolver(format_type=FormatType.YAML)

# This is chunk 2 of a larger document
chunk_text = "is the lead researcher."
token_offset = 5  # Chunk starts at token 5 in full document
char_offset = 15  # Chunk starts at char 15 in full document

extractions = [
    Extraction(
        extraction_class="role",
        extraction_text="lead researcher",
        extraction_index=1
    )
]

aligned = list(resolver.align(
    extractions,
    chunk_text,
    token_offset=token_offset,
    char_offset=char_offset
))

for extraction in aligned:
    # Token and char intervals are relative to full document
    print(f"Token interval: {extraction.token_interval}")
    print(f"Char interval: {extraction.char_interval}")

Notes

  • The resolver uses difflib’s SequenceMatcher for exact alignment
  • Fuzzy alignment scans all candidate windows for best overlap ratio
  • Token normalization applies light stemming (removes trailing ‘s’) to improve matching
  • Use extraction_index_suffix to control extraction ordering
  • Set suppress_parse_errors=True to continue processing despite parse failures
  • Alignment status helps identify extraction quality for filtering or validation
  • WordAligner uses a delimiter (Unicode unit separator) to separate extractions
  • Character and token offsets allow mapping chunks back to original document positions

Build docs developers (and LLMs) love