Basic Extraction

Overview

LangExtract extracts structured information from unstructured text using LLMs. This guide walks you through creating your first extraction task.

Using cloud-hosted models like Gemini requires an API key. See the API Keys guide for setup instructions.

Step-by-Step Guide

Define Your Extraction Task

Create a prompt that clearly describes what you want to extract, then provide a high-quality example to guide the model.

import langextract as lx
import textwrap

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

Examples drive model behavior. Each extraction_text should ideally be verbatim from the example’s text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don’t follow this pattern—resolve these for best results.

Run the Extraction

Provide your input text and the prompt materials to the lx.extract function.

# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

Model Selection: gemini-2.5-flash is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, gemini-2.5-pro may provide superior results.

Model Lifecycle: Gemini models have a lifecycle with defined retirement dates. Consult the official model version documentation to stay informed about the latest stable and legacy versions.

Visualize the Results

Save the extractions to a .jsonl file and generate an interactive HTML visualization to review the entities in context.

# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

This creates an animated and interactive HTML file where you can explore the extracted entities highlighted in their original context.

Understanding the Results

The result object is an AnnotatedDocument containing:

The original text
A list of Extraction objects with:
- extraction_class: The category of the extracted entity
- extraction_text: The verbatim text from the source
- attributes: Additional context as key-value pairs
- char_interval: The exact position in the source text

LLM Knowledge Utilization

This example demonstrates extractions that stay close to the text evidence. The task could be modified to generate attributes that draw more heavily from the LLM’s world knowledge (e.g., adding "identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.

Next Steps

Learn how to process long documents with parallel processing
Explore visualization options for your extractions
Configure different model providers
Set up API keys for production use

Get Started

Core Concepts

Guides

Model Providers

Examples

Overview

Step-by-Step Guide

Understanding the Results

LLM Knowledge Utilization

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Model Providers

Examples

​Overview

​Step-by-Step Guide

​Understanding the Results

​LLM Knowledge Utilization

​Next Steps

Build docs developers (and LLMs) love

Overview

Step-by-Step Guide

Understanding the Results

LLM Knowledge Utilization

Next Steps