Skip to main content
LangExtract Logo

What is LangExtract?

LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.

Why LangExtract?

Precise Source Grounding

Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.

Reliable Structured Outputs

Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini.

Optimized for Long Documents

Overcomes the “needle-in-a-haystack” challenge through text chunking, parallel processing, and multiple passes for higher recall.

Interactive Visualization

Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in context.

Flexible LLM Support

Supports cloud-based LLMs like Google Gemini family and local open-source models via the built-in Ollama interface.

Adaptable to Any Domain

Define extraction tasks for any domain using just a few examples. No model fine-tuning required.

Getting Started

Installation

Install LangExtract via pip, from source, or with Docker

Quick Start

Extract your first structured data in minutes

API Reference

Explore the complete API documentation

Examples

View real-world extraction examples

Key Capabilities

Leverages LLM World Knowledge

Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.

Custom Model Providers

LangExtract supports custom LLM providers via a lightweight plugin system:
  • Add new model support independently of the core library
  • Distribute your provider as a separate Python package
  • Keep custom dependencies isolated
  • Override or extend built-in providers via priority-based resolution

Example Use Cases

  • Healthcare: Extract medications, diagnoses, and treatment plans from clinical notes
  • Legal: Identify key clauses, entities, and obligations from contracts
  • Research: Structure information from academic papers and reports
  • Literature: Analyze characters, emotions, and relationships in texts
  • Business: Extract structured data from unstructured documents and emails

Next Steps

Ready to start extracting? Head to the Installation guide to set up LangExtract, or jump straight to the Quick Start to see it in action.
This is not an officially supported Google product. LangExtract is licensed under Apache 2.0.

Build docs developers (and LLMs) love