Skip to main content

What is an Extraction Task?

An extraction task in LangExtract defines what you want to extract from unstructured text and how the extraction should be performed. Tasks are specified through two key components:
  1. Prompt Description: Natural language instructions explaining the extraction goals
  2. Few-Shot Examples: High-quality demonstrations showing the expected extraction format and behavior
Extraction tasks are domain-agnostic. You can define tasks for any domain—from literary analysis to medical records—without requiring model fine-tuning.

Defining an Extraction Task

Here’s a complete extraction task for identifying characters and emotions in literary text:
import langextract as lx
import textwrap

# 1. Write clear instructions
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide high-quality examples
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks?",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
        ]
    )
]

Few-Shot Learning

Few-shot learning allows you to guide the LLM’s extraction behavior using examples rather than extensive training data. The examples you provide serve as a template for the model to follow.

Why Few-Shot Examples Matter

Examples directly influence:
  • Extraction format: JSON structure, attribute naming conventions
  • Granularity: What level of detail to extract
  • Boundary detection: Where extractions start and end
  • Classification: How to categorize entities
  • Attribute richness: What contextual information to include
LangExtract raises Prompt alignment warnings by default if examples don’t follow best practices. Address these warnings for optimal results.

Best Practices for Examples

1. Use Verbatim Text

Each extraction_text should match the source text exactly—no paraphrasing:
lx.data.Extraction(
    extraction_class="medication",
    extraction_text="aspirin 81mg",  # Exact match from source
    attributes={"dosage": "81mg"}
)

2. Maintain Order of Appearance

List extractions in the order they appear in the source text:
examples = [
    lx.data.ExampleData(
        text="Juliet gazed at Romeo. Romeo smiled back.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="Juliet"),
            lx.data.Extraction(extraction_class="character", extraction_text="Romeo"),
            # Romeo appears twice, first occurrence comes first
        ]
    )
]

3. Avoid Overlapping Entities

Don’t extract text spans that overlap with each other:
extractions=[
    lx.data.Extraction(extraction_text="severe headache"),
    lx.data.Extraction(extraction_text="nausea"),
]

4. Provide Meaningful Attributes

Add attributes that give context and demonstrate the level of detail expected:
lx.data.Extraction(
    extraction_class="relationship",
    extraction_text="Juliet is the sun",
    attributes={
        "type": "metaphor",
        "subject": "Romeo",
        "object": "Juliet"
    }
)

Balancing Text Evidence and LLM Knowledge

Your examples define how much the extraction should rely on:
  • Text evidence: Exact quotes and direct mentions
  • LLM world knowledge: Inferred information and background context

Text-Grounded Extraction

For high precision, keep attributes close to the text:
lx.data.Extraction(
    extraction_class="character",
    extraction_text="Lady Juliet",
    attributes={"emotional_state": "longing"}  # Derived from "gazed longingly"
)

Knowledge-Enhanced Extraction

For richer context, allow the LLM to use its knowledge:
lx.data.Extraction(
    extraction_class="character",
    extraction_text="Lady Juliet",
    attributes={
        "identity": "Capulet family daughter",  # From LLM knowledge
        "literary_context": "tragic heroine"    # From LLM knowledge
    }
)
The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes. Be explicit about what level of inference you expect.

Domain Adaptation

LangExtract adapts to any domain using the same approach:

Medical Records

prompt = "Extract medications with dosages, routes, and frequencies."
examples = [
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg PO daily",
        extractions=[
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="aspirin 81mg PO daily",
                attributes={
                    "drug": "aspirin",
                    "dosage": "81mg",
                    "route": "PO",
                    "frequency": "daily"
                }
            )
        ]
    )
]
prompt = "Extract legal entities, obligations, and dates."
examples = [
    lx.data.ExampleData(
        text="The Buyer shall pay $100,000 by December 31, 2024.",
        extractions=[
            lx.data.Extraction(
                extraction_class="party",
                extraction_text="The Buyer",
                attributes={"role": "obligated_party"}
            ),
            lx.data.Extraction(
                extraction_class="obligation",
                extraction_text="shall pay $100,000",
                attributes={"amount": "100000", "currency": "USD"}
            ),
        ]
    )
]

Running an Extraction Task

Once defined, run your task with lx.extract():
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

print(f"Extracted {len(result.extractions)} entities")
See extraction.py:36-65 for all available parameters.

Next Steps

Prompt Engineering

Learn how to write effective prompts and examples

Source Grounding

Understand character-level mapping and alignment

Build docs developers (and LLMs) love