Skip to main content
The Document class represents an input document to be processed for information extraction. It contains the raw text and optional additional context.

Constructor

Document(
    text: str,
    *,
    document_id: str | None = None,
    additional_context: str | None = None
)
text
str
required
The raw text representation of the document to be processed
document_id
str | None
A unique identifier for the document. If not provided, a unique ID will be auto-generated when accessed.
additional_context
str | None
Additional context to supplement prompt instructions and provide background information for extraction

Attributes

text
str
The raw text representation of the document
document_id
str
A unique identifier for the document. Auto-generated in the format doc_XXXXXXXX if not explicitly set.
additional_context
str | None
Additional context to supplement prompt instructions
tokenized_text
TokenizedText
The tokenized representation of the document text. Automatically computed from text when first accessed.

Example

Basic Usage

from langextract.core.data import Document

# Create a document with just text
doc = Document(text="Apple Inc. was founded in 1976 in Cupertino, California.")

print(doc.text)
print(doc.document_id)  # Auto-generated: e.g., "doc_a3b4c5d6"

With Additional Context

from langextract.core.data import Document

# Create a document with additional context
doc = Document(
    text="The company announced Q3 earnings of $2.5B.",
    document_id="earnings_report_2024_q3",
    additional_context="This is a financial earnings report from a technology company."
)

print(doc.document_id)  # "earnings_report_2024_q3"
print(doc.additional_context)

Using with Extractor

from langextract import Extractor
from langextract.core.data import Document

# Create an extractor
extractor = Extractor(
    extraction_classes=["COMPANY", "DATE", "LOCATION"]
)

# Create a document
doc = Document(
    text="Microsoft was founded by Bill Gates and Paul Allen in Albuquerque in 1975.",
    additional_context="Biography of a major technology company"
)

# Run extraction
result = extractor.run(doc)
print(result.extractions)

Accessing Tokenized Text

from langextract.core.data import Document

doc = Document(text="Natural language processing is fascinating.")

# Access tokenized representation
tokenized = doc.tokenized_text
print(tokenized.tokens)  # List of tokens

Properties

document_id

The document_id property automatically generates a unique identifier if one wasn’t provided during initialization. This ensures every document has a unique ID for tracking and reference.
doc1 = Document(text="Some text")
print(doc1.document_id)  # Auto-generated: "doc_a1b2c3d4"

doc2 = Document(text="Some text", document_id="my_custom_id")
print(doc2.document_id)  # "my_custom_id"

tokenized_text

The tokenized_text property lazily computes and caches the tokenized representation of the document text. It’s automatically created the first time it’s accessed.
doc = Document(text="Hello world")
tokens = doc.tokenized_text  # Computed on first access
tokens_again = doc.tokenized_text  # Returns cached value

Build docs developers (and LLMs) love