Overview
LangExtract extracts structured information from unstructured text using LLMs. This guide walks you through creating your first extraction task.Using cloud-hosted models like Gemini requires an API key. See the API Keys guide for setup instructions.
Step-by-Step Guide
Define Your Extraction Task
Create a prompt that clearly describes what you want to extract, then provide a high-quality example to guide the model.
Examples drive model behavior. Each
extraction_text should ideally be verbatim from the example’s text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don’t follow this pattern—resolve these for best results.Run the Extraction
Provide your input text and the prompt materials to the
lx.extract function.Model Selection:
gemini-2.5-flash is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, gemini-2.5-pro may provide superior results.Understanding the Results
Theresult object is an AnnotatedDocument containing:
- The original text
- A list of
Extractionobjects with:extraction_class: The category of the extracted entityextraction_text: The verbatim text from the sourceattributes: Additional context as key-value pairschar_interval: The exact position in the source text
LLM Knowledge Utilization
This example demonstrates extractions that stay close to the text evidence. The task could be modified to generate attributes that draw more heavily from the LLM’s world knowledge (e.g., adding
"identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.Next Steps
- Learn how to process long documents with parallel processing
- Explore visualization options for your extractions
- Configure different model providers
- Set up API keys for production use