What is LangExtract?
LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.Why LangExtract?
Precise Source Grounding
Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
Reliable Structured Outputs
Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini.
Optimized for Long Documents
Overcomes the “needle-in-a-haystack” challenge through text chunking, parallel processing, and multiple passes for higher recall.
Interactive Visualization
Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in context.
Flexible LLM Support
Supports cloud-based LLMs like Google Gemini family and local open-source models via the built-in Ollama interface.
Adaptable to Any Domain
Define extraction tasks for any domain using just a few examples. No model fine-tuning required.
Getting Started
Installation
Install LangExtract via pip, from source, or with Docker
Quick Start
Extract your first structured data in minutes
API Reference
Explore the complete API documentation
Examples
View real-world extraction examples
Key Capabilities
Leverages LLM World Knowledge
Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.Custom Model Providers
LangExtract supports custom LLM providers via a lightweight plugin system:- Add new model support independently of the core library
- Distribute your provider as a separate Python package
- Keep custom dependencies isolated
- Override or extend built-in providers via priority-based resolution
Example Use Cases
- Healthcare: Extract medications, diagnoses, and treatment plans from clinical notes
- Legal: Identify key clauses, entities, and obligations from contracts
- Research: Structure information from academic papers and reports
- Literature: Analyze characters, emotions, and relationships in texts
- Business: Extract structured data from unstructured documents and emails
Next Steps
Ready to start extracting? Head to the Installation guide to set up LangExtract, or jump straight to the Quick Start to see it in action.This is not an officially supported Google product. LangExtract is licensed under Apache 2.0.