Prompt Engineering

Overview

Prompt engineering in LangExtract involves crafting clear instructions and high-quality examples that guide the LLM’s extraction behavior. Unlike traditional NLP systems, LangExtract doesn’t require training data or model fine-tuning—just well-designed prompts.

Your prompt description and examples are the only inputs that control extraction behavior. Small changes can significantly impact results.

Components of an Effective Prompt

A complete extraction prompt has two parts:

1. Prompt Description

Natural language instructions that explain:

What to extract: Entity types, relationships, attributes
How to extract: Rules, constraints, edge cases
Output requirements: Format, ordering, completeness

2. Few-Shot Examples

Concrete demonstrations showing:

Expected extraction structure
Attribute naming conventions
Boundary detection behavior
Classification logic

Writing Effective Instructions

Be Specific and Clear

prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

Define Extraction Boundaries

Specify whether to extract:

Minimal spans vs. maximal spans
Individual words vs. complete phrases
Single entities vs. nested entities

prompt = textwrap.dedent("""\
    Extract medication mentions as complete phrases including:
    - Drug name
    - Dosage (if present)
    - Route of administration (if present)
    - Frequency (if present)
    
    Example: "aspirin 81mg PO daily" not just "aspirin""")

Specify Ordering Requirements

LangExtract expects extractions in order of appearance by default:

prompt = textwrap.dedent("""\
    Extract entities in the order they appear in the text.
    If an entity appears multiple times, list each occurrence separately.""")

Handle Edge Cases

Explicitly address ambiguous situations:

prompt = textwrap.dedent("""\
    Extract character names:
    - Include titles if part of the name (e.g., "Lady Juliet")
    - Skip pronouns ("he", "she") unless they're the only reference
    - Extract formal names on first mention, nicknames on subsequent mentions""")

Crafting High-Quality Examples

Example Quality Matters Most

The model learns more from what you show than what you tell:

examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

How Many Examples Do You Need?

Typically 1-3 examples are sufficient:

1 example: For simple, well-defined tasks
2-3 examples: For complex tasks with edge cases
3+ examples: For highly nuanced classification

More examples aren’t always better. Focus on quality over quantity. Each example should demonstrate a different aspect of the task.

Example Diversity

Cover different scenarios in your examples:

examples = [
    # Example 1: Standard case
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg daily.",
        extractions=[...]
    ),
    
    # Example 2: Multiple medications
    lx.data.ExampleData(
        text="Patient takes aspirin 81mg and lisinopril 10mg.",
        extractions=[...]
    ),
    
    # Example 3: Complex dosing
    lx.data.ExampleData(
        text="Patient takes metformin 500mg twice daily with meals.",
        extractions=[...]
    ),
]

Prompt Validation

LangExtract validates your examples against best practices using the prompt validation system (see extraction.py:181-192).

Validation Levels

from langextract import prompt_validation as pv

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    prompt_validation_level=pv.PromptValidationLevel.WARNING,  # Default
)

Available levels:

OFF: Skip validation
WARNING: Log issues but continue (default)
ERROR: Raise exception on validation failure

Common Validation Warnings

”Extraction text not found in source"

# Bad: Paraphrased extraction
lx.data.ExampleData(
    text="Patient has severe headache",
    extractions=[
        lx.data.Extraction(
            extraction_text="bad headache"  # Not verbatim!
        )
    ]
)

# Good: Verbatim extraction
lx.data.ExampleData(
    text="Patient has severe headache",
    extractions=[
        lx.data.Extraction(
            extraction_text="severe headache"  # Exact match
        )
    ]
)

"Extractions not in order”

# Bad: Out of order
lx.data.ExampleData(
    text="Romeo loves Juliet",
    extractions=[
        lx.data.Extraction(extraction_text="Juliet"),  # Appears second
        lx.data.Extraction(extraction_text="Romeo"),   # Appears first
    ]
)

# Good: Correct order
lx.data.ExampleData(
    text="Romeo loves Juliet",
    extractions=[
        lx.data.Extraction(extraction_text="Romeo"),
        lx.data.Extraction(extraction_text="Juliet"),
    ]
)

Controlling LLM Knowledge Usage

You can tune how much the LLM relies on its world knowledge vs. text evidence.

Conservative (Text-Grounded)

Stay close to explicit text:

prompt = textwrap.dedent("""\
    Extract only information explicitly stated in the text.
    Do not infer or add information from external knowledge.""")

examples = [
    lx.data.ExampleData(
        text="Lady Juliet gazed longingly at the stars",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Lady Juliet",
                attributes={"emotional_state": "longing"}  # From "longingly"
            )
        ]
    )
]

Leveraging LLM Knowledge

Allow inference and enrichment:

prompt = textwrap.dedent("""\
    Extract characters with contextual information.
    Use your knowledge of the source material to enrich attributes.""")

examples = [
    lx.data.ExampleData(
        text="Lady Juliet gazed longingly at the stars",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Lady Juliet",
                attributes={
                    "emotional_state": "longing",
                    "family": "Capulet",           # From LLM knowledge
                    "literary_role": "protagonist" # From LLM knowledge
                }
            )
        ]
    )
]

The accuracy of inferred information depends on:

Selected LLM capabilities
Task complexity
Clarity of prompt instructions
Quality of few-shot examples

Attribute Design

Attributes add structured context to extractions. Design them thoughtfully:

Consistent Naming

Use consistent attribute names across examples:

# Good: Consistent naming
examples = [
    lx.data.ExampleData(
        text="...",
        extractions=[
            lx.data.Extraction(attributes={"dosage": "81mg"}),
            lx.data.Extraction(attributes={"dosage": "10mg"}),
        ]
    )
]

# Bad: Inconsistent naming
examples = [
    lx.data.ExampleData(
        text="...",
        extractions=[
            lx.data.Extraction(attributes={"dosage": "81mg"}),
            lx.data.Extraction(attributes={"dose": "10mg"}),  # Different key!
        ]
    )
]

Meaningful Values

Provide descriptive attribute values:

# Good: Descriptive
lx.data.Extraction(
    extraction_class="emotion",
    extraction_text="But soft!",
    attributes={"feeling": "gentle awe", "intensity": "moderate"}
)

# Less useful: Vague
lx.data.Extraction(
    extraction_class="emotion",
    extraction_text="But soft!",
    attributes={"type": "emotion"}  # Too generic
)

Prompt engineering is iterative:

Start simple: Basic prompt + 1 example
Run extraction: Test on sample text
Review results: Identify failure patterns
Refine prompt: Add instructions for edge cases
Add examples: Demonstrate correct behavior
Repeat: Continue until results meet requirements

Example Iteration

# Iteration 1: Basic prompt
prompt_v1 = "Extract character names."
# Problem: Extracted pronouns

# Iteration 2: Add constraint
prompt_v2 = "Extract character names. Do not extract pronouns."
# Problem: Missed titles like "Lady Juliet"

# Iteration 3: Clarify boundaries
prompt_v3 = textwrap.dedent("""\
    Extract character names including titles.
    Do not extract pronouns.""")
# Success!

Best Practices Summary

✅ Write specific, detailed instructions
✅ Use 1-3 high-quality examples
✅ Ensure examples use verbatim text
✅ Maintain order of appearance
✅ Avoid overlapping extractions
✅ Use consistent attribute naming
✅ Address edge cases explicitly
✅ Validate prompts before production use
✅ Iterate based on results

Get Started

Core Concepts

Guides

Model Providers

Examples

Overview

Components of an Effective Prompt

1. Prompt Description

2. Few-Shot Examples

Writing Effective Instructions

Be Specific and Clear

Define Extraction Boundaries

Specify Ordering Requirements

Handle Edge Cases

Crafting High-Quality Examples

Example Quality Matters Most

How Many Examples Do You Need?

Example Diversity

Prompt Validation

Validation Levels

Common Validation Warnings

”Extraction text not found in source"

"Extractions not in order”

Controlling LLM Knowledge Usage

Conservative (Text-Grounded)

Leveraging LLM Knowledge

Attribute Design

Consistent Naming

Meaningful Values

Iterative Refinement

Example Iteration

Best Practices Summary

Next Steps

Extraction Tasks

Chunking Strategy

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Model Providers

Examples

​Overview

​Components of an Effective Prompt

​1. Prompt Description

​2. Few-Shot Examples

​Writing Effective Instructions

​Be Specific and Clear

​Define Extraction Boundaries

​Specify Ordering Requirements

​Handle Edge Cases

​Crafting High-Quality Examples

​Example Quality Matters Most

​How Many Examples Do You Need?

​Example Diversity

​Prompt Validation

​Validation Levels

​Common Validation Warnings

​”Extraction text not found in source"

​"Extractions not in order”

​Controlling LLM Knowledge Usage

​Conservative (Text-Grounded)

​Leveraging LLM Knowledge

​Attribute Design

​Consistent Naming

​Meaningful Values

​Iterative Refinement

​Example Iteration

​Best Practices Summary

​Next Steps

Extraction Tasks

Chunking Strategy

Build docs developers (and LLMs) love

Overview

Components of an Effective Prompt

1. Prompt Description

2. Few-Shot Examples

Writing Effective Instructions

Be Specific and Clear

Define Extraction Boundaries

Specify Ordering Requirements

Handle Edge Cases

Crafting High-Quality Examples

Example Quality Matters Most

How Many Examples Do You Need?

Example Diversity

Prompt Validation

Validation Levels

Common Validation Warnings

”Extraction text not found in source"

"Extractions not in order”

Controlling LLM Knowledge Usage

Conservative (Text-Grounded)

Leveraging LLM Knowledge

Attribute Design

Consistent Naming

Meaningful Values

Iterative Refinement

Example Iteration

Best Practices Summary

Next Steps