Skip to main content

Overview

Prompt engineering in LangExtract involves crafting clear instructions and high-quality examples that guide the LLM’s extraction behavior. Unlike traditional NLP systems, LangExtract doesn’t require training data or model fine-tuning—just well-designed prompts.
Your prompt description and examples are the only inputs that control extraction behavior. Small changes can significantly impact results.

Components of an Effective Prompt

A complete extraction prompt has two parts:

1. Prompt Description

Natural language instructions that explain:
  • What to extract: Entity types, relationships, attributes
  • How to extract: Rules, constraints, edge cases
  • Output requirements: Format, ordering, completeness

2. Few-Shot Examples

Concrete demonstrations showing:
  • Expected extraction structure
  • Attribute naming conventions
  • Boundary detection behavior
  • Classification logic

Writing Effective Instructions

Be Specific and Clear

prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

Define Extraction Boundaries

Specify whether to extract:
  • Minimal spans vs. maximal spans
  • Individual words vs. complete phrases
  • Single entities vs. nested entities
prompt = textwrap.dedent("""\
    Extract medication mentions as complete phrases including:
    - Drug name
    - Dosage (if present)
    - Route of administration (if present)
    - Frequency (if present)
    
    Example: "aspirin 81mg PO daily" not just "aspirin""")

Specify Ordering Requirements

LangExtract expects extractions in order of appearance by default:
prompt = textwrap.dedent("""\
    Extract entities in the order they appear in the text.
    If an entity appears multiple times, list each occurrence separately.""")

Handle Edge Cases

Explicitly address ambiguous situations:
prompt = textwrap.dedent("""\
    Extract character names:
    - Include titles if part of the name (e.g., "Lady Juliet")
    - Skip pronouns ("he", "she") unless they're the only reference
    - Extract formal names on first mention, nicknames on subsequent mentions""")

Crafting High-Quality Examples

Example Quality Matters Most

The model learns more from what you show than what you tell:
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

How Many Examples Do You Need?

Typically 1-3 examples are sufficient:
  • 1 example: For simple, well-defined tasks
  • 2-3 examples: For complex tasks with edge cases
  • 3+ examples: For highly nuanced classification
More examples aren’t always better. Focus on quality over quantity. Each example should demonstrate a different aspect of the task.

Example Diversity

Cover different scenarios in your examples:
examples = [
    # Example 1: Standard case
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg daily.",
        extractions=[...]
    ),
    
    # Example 2: Multiple medications
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg and lisinopril 10mg.",
        extractions=[...]
    ),
    
    # Example 3: Complex dosing
    lx.data.ExampleData(
        text="Patient takes metformin 500mg twice daily with meals.",
        extractions=[...]
    ),
]

Prompt Validation

LangExtract validates your examples against best practices using the prompt validation system (see extraction.py:181-192).

Validation Levels

from langextract import prompt_validation as pv

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    prompt_validation_level=pv.PromptValidationLevel.WARNING,  # Default
)
Available levels:
  • OFF: Skip validation
  • WARNING: Log issues but continue (default)
  • ERROR: Raise exception on validation failure

Common Validation Warnings

”Extraction text not found in source"

# Bad: Paraphrased extraction
lx.data.ExampleData(
    text="Patient has severe headache",
    extractions=[
        lx.data.Extraction(
            extraction_text="bad headache"  # Not verbatim!
        )
    ]
)

# Good: Verbatim extraction
lx.data.ExampleData(
    text="Patient has severe headache",
    extractions=[
        lx.data.Extraction(
            extraction_text="severe headache"  # Exact match
        )
    ]
)

"Extractions not in order”

# Bad: Out of order
lx.data.ExampleData(
    text="Romeo loves Juliet",
    extractions=[
        lx.data.Extraction(extraction_text="Juliet"),  # Appears second
        lx.data.Extraction(extraction_text="Romeo"),   # Appears first
    ]
)

# Good: Correct order
lx.data.ExampleData(
    text="Romeo loves Juliet",
    extractions=[
        lx.data.Extraction(extraction_text="Romeo"),
        lx.data.Extraction(extraction_text="Juliet"),
    ]
)

Controlling LLM Knowledge Usage

You can tune how much the LLM relies on its world knowledge vs. text evidence.

Conservative (Text-Grounded)

Stay close to explicit text:
prompt = textwrap.dedent("""\
    Extract only information explicitly stated in the text.
    Do not infer or add information from external knowledge.""")

examples = [
    lx.data.ExampleData(
        text="Lady Juliet gazed longingly at the stars",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Lady Juliet",
                attributes={"emotional_state": "longing"}  # From "longingly"
            )
        ]
    )
]

Leveraging LLM Knowledge

Allow inference and enrichment:
prompt = textwrap.dedent("""\
    Extract characters with contextual information.
    Use your knowledge of the source material to enrich attributes.""")

examples = [
    lx.data.ExampleData(
        text="Lady Juliet gazed longingly at the stars",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Lady Juliet",
                attributes={
                    "emotional_state": "longing",
                    "family": "Capulet",           # From LLM knowledge
                    "literary_role": "protagonist" # From LLM knowledge
                }
            )
        ]
    )
]
The accuracy of inferred information depends on:
  • Selected LLM capabilities
  • Task complexity
  • Clarity of prompt instructions
  • Quality of few-shot examples

Attribute Design

Attributes add structured context to extractions. Design them thoughtfully:

Consistent Naming

Use consistent attribute names across examples:
# Good: Consistent naming
examples = [
    lx.data.ExampleData(
        text="...",
        extractions=[
            lx.data.Extraction(attributes={"dosage": "81mg"}),
            lx.data.Extraction(attributes={"dosage": "10mg"}),
        ]
    )
]

# Bad: Inconsistent naming
examples = [
    lx.data.ExampleData(
        text="...",
        extractions=[
            lx.data.Extraction(attributes={"dosage": "81mg"}),
            lx.data.Extraction(attributes={"dose": "10mg"}),  # Different key!
        ]
    )
]

Meaningful Values

Provide descriptive attribute values:
# Good: Descriptive
lx.data.Extraction(
    extraction_class="emotion",
    extraction_text="But soft!",
    attributes={"feeling": "gentle awe", "intensity": "moderate"}
)

# Less useful: Vague
lx.data.Extraction(
    extraction_class="emotion",
    extraction_text="But soft!",
    attributes={"type": "emotion"}  # Too generic
)

Iterative Refinement

Prompt engineering is iterative:
  1. Start simple: Basic prompt + 1 example
  2. Run extraction: Test on sample text
  3. Review results: Identify failure patterns
  4. Refine prompt: Add instructions for edge cases
  5. Add examples: Demonstrate correct behavior
  6. Repeat: Continue until results meet requirements

Example Iteration

# Iteration 1: Basic prompt
prompt_v1 = "Extract character names."
# Problem: Extracted pronouns

# Iteration 2: Add constraint
prompt_v2 = "Extract character names. Do not extract pronouns."
# Problem: Missed titles like "Lady Juliet"

# Iteration 3: Clarify boundaries
prompt_v3 = textwrap.dedent("""\
    Extract character names including titles.
    Do not extract pronouns.""")
# Success!

Best Practices Summary

  • ✅ Write specific, detailed instructions
  • ✅ Use 1-3 high-quality examples
  • ✅ Ensure examples use verbatim text
  • ✅ Maintain order of appearance
  • ✅ Avoid overlapping extractions
  • ✅ Use consistent attribute naming
  • ✅ Address edge cases explicitly
  • ✅ Validate prompts before production use
  • ✅ Iterate based on results

Next Steps

Extraction Tasks

Learn more about defining extraction tasks

Chunking Strategy

Optimize extraction for long documents

Build docs developers (and LLMs) love