Skip to main content
The AnnotatedDocument class represents a document that has been annotated with extractions. It combines the document text with a list of extracted entities or information.

Constructor

AnnotatedDocument(
    *,
    document_id: str | None = None,
    extractions: list[Extraction] | None = None,
    text: str | None = None
)
document_id
str | None
A unique identifier for the document. If not provided, a unique ID will be auto-generated when accessed.
extractions
list[Extraction] | None
A list of Extraction objects representing information extracted from the document
text
str | None
The raw text representation of the document

Attributes

document_id
str
A unique identifier for the document. Auto-generated in the format doc_XXXXXXXX if not explicitly set.
extractions
list[Extraction] | None
A list of Extraction objects extracted from the document
text
str | None
The raw text representation of the document
tokenized_text
TokenizedText | None
The tokenized representation of the document text. Automatically computed from text when first accessed (if text is available).

Example

Basic Usage

from langextract.core.data import AnnotatedDocument, Extraction

# Create an annotated document
annotated_doc = AnnotatedDocument(
    text="Apple Inc. was founded by Steve Jobs in Cupertino.",
    extractions=[
        Extraction(
            extraction_class="ORGANIZATION",
            extraction_text="Apple Inc."
        ),
        Extraction(
            extraction_class="PERSON",
            extraction_text="Steve Jobs"
        ),
        Extraction(
            extraction_class="LOCATION",
            extraction_text="Cupertino"
        )
    ]
)

print(f"Document ID: {annotated_doc.document_id}")
print(f"Found {len(annotated_doc.extractions)} extractions")

for extraction in annotated_doc.extractions:
    print(f"{extraction.extraction_class}: {extraction.extraction_text}")

With Custom Document ID

from langextract.core.data import AnnotatedDocument, Extraction

annotated_doc = AnnotatedDocument(
    document_id="tech_company_bio_001",
    text="Microsoft was founded in 1975.",
    extractions=[
        Extraction(extraction_class="ORG", extraction_text="Microsoft"),
        Extraction(extraction_class="DATE", extraction_text="1975")
    ]
)

print(annotated_doc.document_id)  # "tech_company_bio_001"

Processing Extractor Results

from langextract import Extractor
from langextract.core.data import Document

# Create and run extractor
extractor = Extractor(
    extraction_classes=["PERSON", "LOCATION", "DATE"]
)

doc = Document(text="Barack Obama was born in Hawaii on August 4, 1961.")
result = extractor.run(doc)  # Returns AnnotatedDocument

# Access the annotated document
print(f"Document: {result.text}")
print(f"Extractions: {len(result.extractions)}")

for extraction in result.extractions:
    print(f"  - {extraction.extraction_class}: {extraction.extraction_text}")

Filtering Extractions

from langextract.core.data import AnnotatedDocument, Extraction

annotated_doc = AnnotatedDocument(
    text="Visit Paris, London, or Tokyo for vacation.",
    extractions=[
        Extraction(extraction_class="LOCATION", extraction_text="Paris"),
        Extraction(extraction_class="LOCATION", extraction_text="London"),
        Extraction(extraction_class="LOCATION", extraction_text="Tokyo")
    ]
)

# Filter extractions by class
locations = [
    e for e in annotated_doc.extractions 
    if e.extraction_class == "LOCATION"
]
print(f"Found {len(locations)} locations")

Creating from Scratch

from langextract.core.data import AnnotatedDocument, Extraction, CharInterval

# Create an annotated document with detailed extraction information
annotated_doc = AnnotatedDocument(
    document_id="article_001",
    text="Amazon CEO Jeff Bezos announced new initiatives.",
    extractions=[
        Extraction(
            extraction_class="ORG",
            extraction_text="Amazon",
            char_interval=CharInterval(start_pos=0, end_pos=6),
            description="E-commerce company"
        ),
        Extraction(
            extraction_class="PERSON",
            extraction_text="Jeff Bezos",
            char_interval=CharInterval(start_pos=11, end_pos=21),
            attributes={"role": "CEO"}
        )
    ]
)

Properties

document_id

The document_id property automatically generates a unique identifier if one wasn’t provided during initialization.
doc = AnnotatedDocument(text="Some text")
print(doc.document_id)  # Auto-generated: "doc_a1b2c3d4"

tokenized_text

The tokenized_text property lazily computes and caches the tokenized representation of the document text when first accessed.
doc = AnnotatedDocument(text="Hello world")
if doc.text:
    tokens = doc.tokenized_text  # Computed on first access
  • Extraction - Represents individual extractions
  • Document - Input document without annotations
  • extract() - Main function that returns AnnotatedDocument as output
  • CharInterval - Represents character positions in extractions (nested class)

Build docs developers (and LLMs) love