Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BoundaryML/baml/llms.txt

Use this file to discover all available pages before exploring further.

Data extraction is the process of pulling structured information from unstructured text. BAML excels at this by providing type-safe schemas and automatic validation.

Basic Data Extraction

Extracting Names from Text

Let’s start with a simple example - extracting names from unstructured text:
extract_names.baml
function ExtractNames(input: string) -> string[] {
  client "openai/gpt-4o"
  prompt #"
    Extract the names from this INPUT:
  
    INPUT:
    ---
    {{ input }}
    ---

    {{ ctx.output_format }}

    Response:
  "#
}

test ExtractNamesTest {
  functions [ExtractNames]
  args {
    input #"
      John Smith and Sarah Johnson met with Dr. Michael Chen 
      to discuss the project. Emily Rodriguez will join next week.
    "#
  }
}
from baml_client import b

text = "The meeting was attended by Alice Wang, Bob Miller, and Carol Davis."
names = b.ExtractNames(text)

print(f"Found {len(names)} names:")
for name in names:
    print(f"  - {name}")

# Output:
# Found 3 names:
#   - Alice Wang
#   - Bob Miller
#   - Carol Davis

Receipt Information Extraction

A common real-world use case is extracting structured data from receipts:
receipt_extractor.baml
class ReceiptItem {
  name string
  description string?
  quantity int
  price float
}

class ReceiptInfo {
  items ReceiptItem[]
  total_cost float?
  venue "barista" | "restaurant" | "grocery" | "other"
  date string?
}

function ExtractReceiptInfo(email: string) -> ReceiptInfo {
  client "openai/gpt-4o"
  prompt #"
    Given the receipt below, extract all items with their details:


{{ ctx.output_format }}
"#
}

Test with Sample Receipt

receipt_extractor.baml
test CafeReceipt {
  functions [ExtractReceiptInfo]
  args {
    email #"
      Thanks for visiting Barista Coffee!
      
      Order #12345
      Date: 2024-03-15
      
      2x Latte - $5.50 each
      1x Croissant - $3.50
      1x Cappuccino - $5.00
      
      Subtotal: $19.50
      Tax: $1.56
      Total: $21.06
    "#
  }
}

Usage in Application

from baml_client import b
from baml_client.types import ReceiptInfo

def process_receipt(receipt_text: str) -> ReceiptInfo:
    receipt = b.ExtractReceiptInfo(receipt_text)
    
    print(f"Venue: {receipt.venue}")
    print(f"Total: ${receipt.total_cost}")
    print(f"\nItems:")
    
    for item in receipt.items:
        item_total = item.price * item.quantity
        print(f"  {item.quantity}x {item.name}: ${item_total:.2f}")
    
    return receipt

# Example usage
receipt_email = """
Your order from Joe's Pizza

3x Margherita Pizza - $12.99 each
2x Caesar Salad - $8.50 each
1x Garlic Bread - $4.99

Total: $59.96
"""

result = process_receipt(receipt_email)

PII Data Extraction and Scrubbing

Extract personally identifiable information (PII) from documents for compliance and security:
pii_extractor.baml
class PIIData {
  index int
  dataType string
  value string
}

class PIIExtraction {
  privateData PIIData[]
  containsSensitivePII bool @description("E.g. SSN, credit card")
}

function ExtractPII(document: string) -> PIIExtraction {
  client "openai/gpt-4o-mini"
  prompt #"
    Extract all personally identifiable information (PII) from the given document.
    Look for:
    - Names
    - Email addresses
    - Phone numbers
    - Addresses
    - Social security numbers
    - Dates of birth
    - Any other personal data

    {{ ctx.output_format }}

    {{ _.role("user") }} 
    
    {{ document }}
  "#
}

Test PII Extraction

pii_extractor.baml
test BasicPIIExtraction {
  functions [ExtractPII]
  args {
    document #"
      John Doe was born on 01/02/1980. 
      His email is john.doe@email.com and phone is 555-123-4567.
      He lives at 123 Main St, Springfield, IL 62704.
    "#
  }
}

Scrubbing Implementation

from baml_client import b
from typing import Dict, Tuple

def scrub_document(text: str) -> Tuple[str, Dict[str, str]]:
    # Extract PII from the document
    result = b.ExtractPII(text)
    
    scrubbed_text = text
    pii_mapping = {}
    
    # Replace each PII item with a placeholder
    for pii_item in result.privateData:
        pii_type = pii_item.dataType.upper()
        placeholder = f"[{pii_type}_{pii_item.index}]"
        
        # Store the mapping
        pii_mapping[placeholder] = pii_item.value
        
        # Replace in text
        scrubbed_text = scrubbed_text.replace(pii_item.value, placeholder)
    
    return scrubbed_text, pii_mapping

def restore_document(scrubbed_text: str, pii_mapping: Dict[str, str]) -> str:
    """Restore original text using the PII mapping."""
    restored_text = scrubbed_text
    for placeholder, original_value in pii_mapping.items():
        restored_text = restored_text.replace(placeholder, original_value)
    return restored_text

# Example usage
document = """
John Smith works at Tech Corp.
You can reach him at john.smith@techcorp.com
or call 555-0123 during business hours.
"""

scrubbed, mapping = scrub_document(document)
print("Scrubbed:", scrubbed)
# Output: [NAME_1] works at Tech Corp...

restored = restore_document(scrubbed, mapping)
print("Restored:", restored)

Extracting Action Items from Meeting Transcripts

Extract structured tasks from meeting notes:
action_items.baml
class Subtask {
  id int
  name string
}

enum Priority {
  HIGH
  MEDIUM
  LOW
}

class Ticket {
  id int
  name string 
  description string
  priority Priority
  assignees string[]
  subtasks Subtask[]
  dependencies int[] @description("IDs of tasks this depends on")
}

function ExtractTasks(transcript: string) -> Ticket[] {
  client "openai/gpt-4o"
  prompt #"
    You are an expert at analyzing meeting transcripts.
    Extract all action items, tasks, and subtasks.
    
    For each task:
    - Generate a unique ID
    - Identify assignees
    - Set appropriate priority
    - List subtasks if any
    - Note dependencies on other tasks

    {{ ctx.output_format }}

    {{ _.role("user") }}
    {{ transcript }}
  "#
}

Test with Meeting Transcript

action_items.baml
test ComplexMeeting {
  functions [ExtractTasks]
  args {
    transcript #"
      Alice: We need to improve the authentication system. High priority.
      Bob: I can lead that. We need front-end and back-end work.
      Carol: I'll handle the front-end part.
      Bob: I'll do the back-end optimization.
      Alice: After auth is done, we need to integrate with billing.
      Bob: I can do the billing system too, but after back-end auth.
      Alice: Finally, update the docs. Lower priority.
      Carol: I'll update docs after my front-end work is done.
    "#
  }
}
from baml_client import b
from baml_client.types import Priority

def extract_action_items(transcript: str):
    tasks = b.ExtractTasks(transcript)
    
    # Organize by priority
    high_priority = [t for t in tasks if t.priority == Priority.HIGH]
    medium_priority = [t for t in tasks if t.priority == Priority.MEDIUM]
    low_priority = [t for t in tasks if t.priority == Priority.LOW]
    
    print(f"Found {len(tasks)} tasks")
    print(f"\nHigh Priority ({len(high_priority)}):")
    for task in high_priority:
        print(f"  - {task.name} (assigned to: {', '.join(task.assignees)})")
    
    return tasks

# Example usage
meeting_notes = """
Sarah: We need to launch the new feature by Friday.
Mike: I'll handle the API implementation.
Sarah: Great. We also need to update the UI.
Lisa: I can do the UI updates after Mike finishes the API.
"""

tasks = extract_action_items(meeting_notes)

Best Practices

1. Use Optional Fields for Incomplete Data

class Contact {
  name string
  email string?
  phone string?
  address string?
}
This allows the extraction to succeed even when some information is missing.

2. Add Descriptions for Complex Fields

class Invoice {
  invoice_number string
  date string @description("Format: YYYY-MM-DD")
  due_date string @description("Format: YYYY-MM-DD")
  amount float @description("Total amount in USD")
}

3. Use Enums for Categorical Data

enum DocumentType {
  INVOICE
  RECEIPT
  CONTRACT
  OTHER
}

class Document {
  type DocumentType
  content string
}

4. Validate Extracted Data

from baml_client import b
import re

def extract_and_validate_contact(text: str):
    contact = b.ExtractContact(text)
    
    # Validate email format
    if contact.email:
        email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        if not re.match(email_pattern, contact.email):
            print(f"Warning: Invalid email format: {contact.email}")
    
    # Validate phone format
    if contact.phone:
        phone_pattern = r'^\d{3}-\d{3}-\d{4}$'
        if not re.match(phone_pattern, contact.phone):
            print(f"Warning: Invalid phone format: {contact.phone}")
    
    return contact

Next Steps

  • Explore Classification for categorizing extracted data
  • Learn about Tool Calling to take actions with extracted data
  • Check out RAG for extraction with context from knowledge bases

Build docs developers (and LLMs) love