Overview
This example shows how to:- Process large documents (147,843 characters / ~44,000 tokens)
- Use multiple extraction passes for improved recall
- Implement parallel processing for speed
- Optimize chunk sizes for better accuracy
- Generate interactive visualizations at scale
Implementation
Step 1: Define Comprehensive Prompt and Examples
For large complex inputs, using more detailed examples is suggested to increase extraction robustness.Step 2: Process Document with Optimized Settings
Step 3: Save and Visualize Results
Step 4: Analyze Extraction Results
Sample Output
Key Benefits for Long Documents
- Sequential Extraction Passes
- Portable JSONL Format
- Smart Chunking
- Enhanced Accuracy
- Interactive Visualization
- Schema-Guided Extraction
Multiple extraction passes improve recall by performing independent extractions and merging non-overlapping results. Each pass uses identical parameters and processing—they are independent runs of the same extraction task.How it works: Each pass processes the full text independently using the same prompt and examples. Results are then merged using a “first-pass wins” strategy for overlapping entities, while adding unique non-overlapping entities from later passes. This approach captures entities that might be missed in any single run due to the stochastic nature of language model generation.The number of passes is controlled by the
extraction_passes parameter (e.g., extraction_passes=3).Models like Gemini 1.5 Pro show strong performance on many benchmarks, but needle-in-a-haystack tests across million-token contexts indicate that performance can vary in multi-fact retrieval scenarios. This demonstrates how LangExtract’s smaller context windows approach ensures consistently high quality across entire documents by avoiding the complexity and potential degradation of massive single-context processing.