Skip to main content

Function Signature

def load_annotated_documents_jsonl(
    jsonl_path: pathlib.Path,
    show_progress: bool = True,
) -> Iterator[data.AnnotatedDocument]

Description

Loads annotated documents from a JSON Lines file. Each line in the file is parsed as a separate JSON object and converted to an AnnotatedDocument. This function yields documents incrementally, making it memory-efficient for large datasets.

Parameters

jsonl_path
pathlib.Path
required
The file path to the JSON Lines file containing annotated documents. This should be a file previously saved using save_annotated_documents() or following the same format.
show_progress
bool
default:"True"
Whether to show a progress bar during the loading operation. The progress bar tracks bytes read and provides an estimate of completion time.

Returns

documents
Iterator[AnnotatedDocument]
An iterator that yields AnnotatedDocument objects. Each document contains the original text, document ID, and extracted annotations.

Exceptions

  • IOError: If the file does not exist or cannot be read.
  • json.JSONDecodeError: If a line in the file contains invalid JSON.

Usage Example

from langextract import io
from pathlib import Path

# Load documents from a JSONL file
jsonl_file = Path("test_output/data.jsonl")
documents = io.load_annotated_documents_jsonl(jsonl_file)

# Process documents incrementally
for doc in documents:
    print(f"Document ID: {doc.document_id}")
    print(f"Text: {doc.text}")
    print(f"Annotations: {doc.annotations}")
    print("-" * 50)

# Load without progress bar
documents = io.load_annotated_documents_jsonl(
    Path("results/annotations.jsonl"),
    show_progress=False
)

# Convert to list (loads all into memory)
all_docs = list(io.load_annotated_documents_jsonl(jsonl_file))
print(f"Loaded {len(all_docs)} documents")

# Process in batches
def process_batch(docs, batch_size=100):
    batch = []
    for doc in docs:
        batch.append(doc)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

for batch in process_batch(io.load_annotated_documents_jsonl(jsonl_file)):
    # Process batch of 100 documents
    print(f"Processing batch of {len(batch)} documents")

Notes

  • The function reads files with UTF-8 encoding.
  • Empty lines in the file are automatically skipped.
  • Progress tracking is based on file size (bytes read), providing accurate progress estimates.
  • The function is memory-efficient as it yields documents one at a time rather than loading all into memory.
  • Progress information includes the total number of documents loaded and the file path.

Build docs developers (and LLMs) love