Extraction Tasks

What is an Extraction Task?

An extraction task in LangExtract defines what you want to extract from unstructured text and how the extraction should be performed. Tasks are specified through two key components:

Prompt Description: Natural language instructions explaining the extraction goals
Few-Shot Examples: High-quality demonstrations showing the expected extraction format and behavior

Extraction tasks are domain-agnostic. You can define tasks for any domain—from literary analysis to medical records—without requiring model fine-tuning.

Defining an Extraction Task

Here’s a complete extraction task for identifying characters and emotions in literary text:

import langextract as lx
import textwrap

# 1. Write clear instructions
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide high-quality examples
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks?",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
        ]
    )
]

Few-Shot Learning

Few-shot learning allows you to guide the LLM’s extraction behavior using examples rather than extensive training data. The examples you provide serve as a template for the model to follow.

Why Few-Shot Examples Matter

Examples directly influence:

Extraction format: JSON structure, attribute naming conventions
Granularity: What level of detail to extract
Boundary detection: Where extractions start and end
Classification: How to categorize entities
Attribute richness: What contextual information to include

LangExtract raises Prompt alignment warnings by default if examples don’t follow best practices. Address these warnings for optimal results.

Best Practices for Examples

1. Use Verbatim Text

Each extraction_text should match the source text exactly—no paraphrasing:

lx.data.Extraction(
    extraction_class="medication",
    extraction_text="aspirin 81mg",  # Exact match from source
    attributes={"dosage": "81mg"}
)

2. Maintain Order of Appearance

List extractions in the order they appear in the source text:

examples = [
    lx.data.ExampleData(
        text="Juliet gazed at Romeo. Romeo smiled back.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="Juliet"),
            lx.data.Extraction(extraction_class="character", extraction_text="Romeo"),
            # Romeo appears twice, first occurrence comes first
        ]
    )
]

3. Avoid Overlapping Entities

Don’t extract text spans that overlap with each other:

extractions=[
    lx.data.Extraction(extraction_text="severe headache"),
    lx.data.Extraction(extraction_text="nausea"),
]

4. Provide Meaningful Attributes

Add attributes that give context and demonstrate the level of detail expected:

lx.data.Extraction(
    extraction_class="relationship",
    extraction_text="Juliet is the sun",
    attributes={
        "type": "metaphor",
        "subject": "Romeo",
        "object": "Juliet"
    }
)

Balancing Text Evidence and LLM Knowledge

Your examples define how much the extraction should rely on:

Text evidence: Exact quotes and direct mentions
LLM world knowledge: Inferred information and background context

Text-Grounded Extraction

For high precision, keep attributes close to the text:

lx.data.Extraction(
    extraction_class="character",
    extraction_text="Lady Juliet",
    attributes={"emotional_state": "longing"}  # Derived from "gazed longingly"
)

Knowledge-Enhanced Extraction

For richer context, allow the LLM to use its knowledge:

lx.data.Extraction(
    extraction_class="character",
    extraction_text="Lady Juliet",
    attributes={
        "identity": "Capulet family daughter",  # From LLM knowledge
        "literary_context": "tragic heroine"    # From LLM knowledge
    }
)

The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes. Be explicit about what level of inference you expect.

Domain Adaptation

LangExtract adapts to any domain using the same approach:

Medical Records

prompt = "Extract medications with dosages, routes, and frequencies."
examples = [
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg PO daily",
        extractions=[
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="aspirin 81mg PO daily",
                attributes={
                    "drug": "aspirin",
                    "dosage": "81mg",
                    "route": "PO",
                    "frequency": "daily"
                }
            )
        ]
    )
]

Legal Documents

prompt = "Extract legal entities, obligations, and dates."
examples = [
    lx.data.ExampleData(
        text="The Buyer shall pay $100,000 by December 31, 2024.",
        extractions=[
            lx.data.Extraction(
                extraction_class="party",
                extraction_text="The Buyer",
                attributes={"role": "obligated_party"}
            ),
            lx.data.Extraction(
                extraction_class="obligation",
                extraction_text="shall pay $100,000",
                attributes={"amount": "100000", "currency": "USD"}
            ),
        ]
    )
]

Running an Extraction Task

Once defined, run your task with lx.extract():

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

print(f"Extracted {len(result.extractions)} entities")

See extraction.py:36-65 for all available parameters.

Next Steps

Prompt Engineering

Learn how to write effective prompts and examples

Source Grounding

Understand character-level mapping and alignment

Get Started

Core Concepts

Guides

Model Providers

Examples

What is an Extraction Task?

Defining an Extraction Task

Few-Shot Learning

Why Few-Shot Examples Matter

Best Practices for Examples

1. Use Verbatim Text

2. Maintain Order of Appearance

3. Avoid Overlapping Entities

4. Provide Meaningful Attributes

Balancing Text Evidence and LLM Knowledge

Text-Grounded Extraction

Knowledge-Enhanced Extraction

Domain Adaptation

Medical Records

Legal Documents

Running an Extraction Task

Next Steps

Prompt Engineering

Source Grounding

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Model Providers

Examples

​What is an Extraction Task?

​Defining an Extraction Task

​Few-Shot Learning

​Why Few-Shot Examples Matter

​Best Practices for Examples

​1. Use Verbatim Text

​2. Maintain Order of Appearance

​3. Avoid Overlapping Entities

​4. Provide Meaningful Attributes

​Balancing Text Evidence and LLM Knowledge

​Text-Grounded Extraction

​Knowledge-Enhanced Extraction

​Domain Adaptation

​Medical Records

​Legal Documents

​Running an Extraction Task

​Next Steps

Prompt Engineering

Source Grounding

Build docs developers (and LLMs) love

What is an Extraction Task?

Defining an Extraction Task

Few-Shot Learning

Why Few-Shot Examples Matter

Best Practices for Examples

1. Use Verbatim Text

2. Maintain Order of Appearance

3. Avoid Overlapping Entities

4. Provide Meaningful Attributes

Balancing Text Evidence and LLM Knowledge

Text-Grounded Extraction

Knowledge-Enhanced Extraction

Domain Adaptation

Medical Records

Legal Documents

Running an Extraction Task

Next Steps