Document class represents an input document to be processed for information extraction. It contains the raw text and optional additional context.
Constructor
The raw text representation of the document to be processed
A unique identifier for the document. If not provided, a unique ID will be auto-generated when accessed.
Additional context to supplement prompt instructions and provide background information for extraction
Attributes
The raw text representation of the document
A unique identifier for the document. Auto-generated in the format
doc_XXXXXXXX if not explicitly set.Additional context to supplement prompt instructions
The tokenized representation of the document text. Automatically computed from
text when first accessed.Example
Basic Usage
With Additional Context
Using with Extractor
Accessing Tokenized Text
Properties
document_id
Thedocument_id property automatically generates a unique identifier if one wasn’t provided during initialization. This ensures every document has a unique ID for tracking and reference.
tokenized_text
Thetokenized_text property lazily computes and caches the tokenized representation of the document text. It’s automatically created the first time it’s accessed.
Related Classes
- AnnotatedDocument - Document with extractions
- ExampleData - Training examples for few-shot learning
- extract() - Main function that processes documents to extract information