The AnnotatedDocument class represents a document that has been annotated with extractions. It combines the document text with a list of extracted entities or information.
Constructor
AnnotatedDocument(
*,
document_id: str | None = None,
extractions: list[Extraction] | None = None,
text: str | None = None
)
A unique identifier for the document. If not provided, a unique ID will be auto-generated when accessed.
A list of Extraction objects representing information extracted from the document
The raw text representation of the document
Attributes
A unique identifier for the document. Auto-generated in the format doc_XXXXXXXX if not explicitly set.
A list of Extraction objects extracted from the document
The raw text representation of the document
The tokenized representation of the document text. Automatically computed from text when first accessed (if text is available).
Example
Basic Usage
from langextract.core.data import AnnotatedDocument, Extraction
# Create an annotated document
annotated_doc = AnnotatedDocument(
text="Apple Inc. was founded by Steve Jobs in Cupertino.",
extractions=[
Extraction(
extraction_class="ORGANIZATION",
extraction_text="Apple Inc."
),
Extraction(
extraction_class="PERSON",
extraction_text="Steve Jobs"
),
Extraction(
extraction_class="LOCATION",
extraction_text="Cupertino"
)
]
)
print(f"Document ID: {annotated_doc.document_id}")
print(f"Found {len(annotated_doc.extractions)} extractions")
for extraction in annotated_doc.extractions:
print(f"{extraction.extraction_class}: {extraction.extraction_text}")
With Custom Document ID
from langextract.core.data import AnnotatedDocument, Extraction
annotated_doc = AnnotatedDocument(
document_id="tech_company_bio_001",
text="Microsoft was founded in 1975.",
extractions=[
Extraction(extraction_class="ORG", extraction_text="Microsoft"),
Extraction(extraction_class="DATE", extraction_text="1975")
]
)
print(annotated_doc.document_id) # "tech_company_bio_001"
from langextract import Extractor
from langextract.core.data import Document
# Create and run extractor
extractor = Extractor(
extraction_classes=["PERSON", "LOCATION", "DATE"]
)
doc = Document(text="Barack Obama was born in Hawaii on August 4, 1961.")
result = extractor.run(doc) # Returns AnnotatedDocument
# Access the annotated document
print(f"Document: {result.text}")
print(f"Extractions: {len(result.extractions)}")
for extraction in result.extractions:
print(f" - {extraction.extraction_class}: {extraction.extraction_text}")
from langextract.core.data import AnnotatedDocument, Extraction
annotated_doc = AnnotatedDocument(
text="Visit Paris, London, or Tokyo for vacation.",
extractions=[
Extraction(extraction_class="LOCATION", extraction_text="Paris"),
Extraction(extraction_class="LOCATION", extraction_text="London"),
Extraction(extraction_class="LOCATION", extraction_text="Tokyo")
]
)
# Filter extractions by class
locations = [
e for e in annotated_doc.extractions
if e.extraction_class == "LOCATION"
]
print(f"Found {len(locations)} locations")
Creating from Scratch
from langextract.core.data import AnnotatedDocument, Extraction, CharInterval
# Create an annotated document with detailed extraction information
annotated_doc = AnnotatedDocument(
document_id="article_001",
text="Amazon CEO Jeff Bezos announced new initiatives.",
extractions=[
Extraction(
extraction_class="ORG",
extraction_text="Amazon",
char_interval=CharInterval(start_pos=0, end_pos=6),
description="E-commerce company"
),
Extraction(
extraction_class="PERSON",
extraction_text="Jeff Bezos",
char_interval=CharInterval(start_pos=11, end_pos=21),
attributes={"role": "CEO"}
)
]
)
Properties
document_id
The document_id property automatically generates a unique identifier if one wasn’t provided during initialization.
doc = AnnotatedDocument(text="Some text")
print(doc.document_id) # Auto-generated: "doc_a1b2c3d4"
tokenized_text
The tokenized_text property lazily computes and caches the tokenized representation of the document text when first accessed.
doc = AnnotatedDocument(text="Hello world")
if doc.text:
tokens = doc.tokenized_text # Computed on first access
- Extraction - Represents individual extractions
- Document - Input document without annotations
- extract() - Main function that returns AnnotatedDocument as output
CharInterval - Represents character positions in extractions (nested class)