Document

The Document class represents an input document to be processed for information extraction. It contains the raw text and optional additional context.

Constructor

Document(
    text: str,
    *,
    document_id: str | None = None,
    additional_context: str | None = None
)

text

str

required

The raw text representation of the document to be processed

document_id

str | None

A unique identifier for the document. If not provided, a unique ID will be auto-generated when accessed.

additional_context

str | None

Additional context to supplement prompt instructions and provide background information for extraction

Attributes

text

str

The raw text representation of the document

document_id

str

A unique identifier for the document. Auto-generated in the format doc_XXXXXXXX if not explicitly set.

additional_context

str | None

Additional context to supplement prompt instructions

tokenized_text

TokenizedText

The tokenized representation of the document text. Automatically computed from text when first accessed.

Example

Basic Usage

from langextract.core.data import Document

# Create a document with just text
doc = Document(text="Apple Inc. was founded in 1976 in Cupertino, California.")

print(doc.text)
print(doc.document_id)  # Auto-generated: e.g., "doc_a3b4c5d6"

With Additional Context

from langextract.core.data import Document

# Create a document with additional context
doc = Document(
    text="The company announced Q3 earnings of $2.5B.",
    document_id="earnings_report_2024_q3",
    additional_context="This is a financial earnings report from a technology company."
)

print(doc.document_id)  # "earnings_report_2024_q3"
print(doc.additional_context)

Using with Extractor

from langextract import Extractor
from langextract.core.data import Document

# Create an extractor
extractor = Extractor(
    extraction_classes=["COMPANY", "DATE", "LOCATION"]
)

# Create a document
doc = Document(
    text="Microsoft was founded by Bill Gates and Paul Allen in Albuquerque in 1975.",
    additional_context="Biography of a major technology company"
)

# Run extraction
result = extractor.run(doc)
print(result.extractions)

Accessing Tokenized Text

from langextract.core.data import Document

doc = Document(text="Natural language processing is fascinating.")

# Access tokenized representation
tokenized = doc.tokenized_text
print(tokenized.tokens)  # List of tokens

Properties

document_id

The document_id property automatically generates a unique identifier if one wasn’t provided during initialization. This ensures every document has a unique ID for tracking and reference.

doc1 = Document(text="Some text")
print(doc1.document_id)  # Auto-generated: "doc_a1b2c3d4"

doc2 = Document(text="Some text", document_id="my_custom_id")
print(doc2.document_id)  # "my_custom_id"

tokenized_text

The tokenized_text property lazily computes and caches the tokenized representation of the document text. It’s automatically created the first time it’s accessed.

doc = Document(text="Hello world")
tokens = doc.tokenized_text  # Computed on first access
tokens_again = doc.tokenized_text  # Returns cached value

AnnotatedDocument - Document with extractions
ExampleData - Training examples for few-shot learning
extract() - Main function that processes documents to extract information

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

Constructor

Attributes

Example

Basic Usage

With Additional Context

Using with Extractor

Accessing Tokenized Text

Properties

document_id

tokenized_text

Build docs developers (and LLMs) love

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

​Constructor

​Attributes

​Example

​Basic Usage

​With Additional Context

​Using with Extractor

​Accessing Tokenized Text

​Properties

​document_id

​tokenized_text

​Related Classes

Build docs developers (and LLMs) love

Constructor

Attributes

Example

Basic Usage

With Additional Context

Using with Extractor

Accessing Tokenized Text

Properties

document_id

tokenized_text

Related Classes