Skip to main content

Overview

LangExtract extracts structured information from unstructured text using LLMs. This guide walks you through creating your first extraction task.
Using cloud-hosted models like Gemini requires an API key. See the API Keys guide for setup instructions.

Step-by-Step Guide

1

Define Your Extraction Task

Create a prompt that clearly describes what you want to extract, then provide a high-quality example to guide the model.
import langextract as lx
import textwrap

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]
Examples drive model behavior. Each extraction_text should ideally be verbatim from the example’s text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don’t follow this pattern—resolve these for best results.
2

Run the Extraction

Provide your input text and the prompt materials to the lx.extract function.
# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)
Model Selection: gemini-2.5-flash is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, gemini-2.5-pro may provide superior results.
Model Lifecycle: Gemini models have a lifecycle with defined retirement dates. Consult the official model version documentation to stay informed about the latest stable and legacy versions.
3

Visualize the Results

Save the extractions to a .jsonl file and generate an interactive HTML visualization to review the entities in context.
# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)
This creates an animated and interactive HTML file where you can explore the extracted entities highlighted in their original context.

Understanding the Results

The result object is an AnnotatedDocument containing:
  • The original text
  • A list of Extraction objects with:
    • extraction_class: The category of the extracted entity
    • extraction_text: The verbatim text from the source
    • attributes: Additional context as key-value pairs
    • char_interval: The exact position in the source text

LLM Knowledge Utilization

This example demonstrates extractions that stay close to the text evidence. The task could be modified to generate attributes that draw more heavily from the LLM’s world knowledge (e.g., adding "identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.

Next Steps

Build docs developers (and LLMs) love