load_annotated_documents_jsonl

Function Signature

def load_annotated_documents_jsonl(
    jsonl_path: pathlib.Path,
    show_progress: bool = True,
) -> Iterator[data.AnnotatedDocument]

Description

Loads annotated documents from a JSON Lines file. Each line in the file is parsed as a separate JSON object and converted to an AnnotatedDocument. This function yields documents incrementally, making it memory-efficient for large datasets.

Parameters

jsonl_path

pathlib.Path

required

The file path to the JSON Lines file containing annotated documents. This should be a file previously saved using save_annotated_documents() or following the same format.

show_progress

bool

default:"True"

Whether to show a progress bar during the loading operation. The progress bar tracks bytes read and provides an estimate of completion time.

Returns

documents

Iterator[AnnotatedDocument]

An iterator that yields AnnotatedDocument objects. Each document contains the original text, document ID, and extracted annotations.

Exceptions

IOError: If the file does not exist or cannot be read.
json.JSONDecodeError: If a line in the file contains invalid JSON.

Usage Example

from langextract import io
from pathlib import Path

# Load documents from a JSONL file
jsonl_file = Path("test_output/data.jsonl")
documents = io.load_annotated_documents_jsonl(jsonl_file)

# Process documents incrementally
for doc in documents:
    print(f"Document ID: {doc.document_id}")
    print(f"Text: {doc.text}")
    print(f"Annotations: {doc.annotations}")
    print("-" * 50)

# Load without progress bar
documents = io.load_annotated_documents_jsonl(
    Path("results/annotations.jsonl"),
    show_progress=False
)

# Convert to list (loads all into memory)
all_docs = list(io.load_annotated_documents_jsonl(jsonl_file))
print(f"Loaded {len(all_docs)} documents")

# Process in batches
def process_batch(docs, batch_size=100):
    batch = []
    for doc in docs:
        batch.append(doc)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

for batch in process_batch(io.load_annotated_documents_jsonl(jsonl_file)):
    # Process batch of 100 documents
    print(f"Processing batch of {len(batch)} documents")

Notes

The function reads files with UTF-8 encoding.
Empty lines in the file are automatically skipped.
Progress tracking is based on file size (bytes read), providing accurate progress estimates.
The function is memory-efficient as it yields documents one at a time rather than loading all into memory.
Progress information includes the total number of documents loaded and the file path.

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

load_annotated_documents_jsonl

Function Signature

Description

Parameters

Returns

Exceptions

Usage Example

Notes

Build docs developers (and LLMs) love

Core API

Data Classes

I/O Operations

Factory & Configuration

Provider API

Advanced

​Function Signature

​Description

​Parameters

​Returns

​Exceptions

​Usage Example

​Notes

Build docs developers (and LLMs) love

Function Signature

Description

Parameters

Returns

Exceptions

Usage Example

Notes