Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/typeagent-py/llms.txt

Use this file to discover all available pages before exploring further.

TypeAgent provides comprehensive email integration for ingesting and querying email conversations from multiple sources.

Email Workflow Overview

Working with emails follows a three-step workflow:
1
Download Emails
2
Fetch raw .eml files from your email provider
3
Ingest Emails
4
Parse and index emails into a TypeAgent database
5
Query Emails
6
Ask natural language questions about your email content

Email Message Format

TypeAgent represents emails using the EmailMessage class:
from typeagent.emails.email_message import EmailMessage, EmailMessageMeta

message = EmailMessage(
    text_chunks=["Subject: Project Update", "The project is on track..."],
    metadata=EmailMessageMeta(
        sender="alice@example.com",
        recipients=["bob@example.com", "carol@example.com"],
        cc=["dave@example.com"],
        subject="Project Update",
        id="<message-id@example.com>"
    ),
    timestamp="2024-01-15T10:30:00Z",
    src_url="/path/to/email.eml"
)

Metadata Structure

Email metadata includes:
  • sender: From address
  • recipients: To addresses (list)
  • cc: CC addresses (list)
  • bcc: BCC addresses (list, if available)
  • subject: Email subject line
  • id: Message-ID header
  • timestamp: ISO 8601 timestamp
  • src_url: Source file path or identifier

Importing Email Files

TypeAgent provides utilities for importing .eml files:
from typeagent.emails.email_import import import_email_from_file

# Import single email
email = import_email_from_file("message.eml")

print(f"From: {email.metadata.sender}")
print(f"To: {', '.join(email.metadata.recipients)}")
print(f"Subject: {email.metadata.subject}")
print(f"Body chunks: {len(email.text_chunks)}")

Import from Directory

from typeagent.emails.email_import import import_emails_from_dir

# Import all .eml files from directory
for email in import_emails_from_dir("inbox_dump"):
    print(f"Imported: {email.metadata.subject}")

Import from String

from typeagent.emails.email_import import import_email_string

# Import from MIME string
with open("message.eml", "r") as f:
    mime_string = f.read()

email = import_email_string(mime_string)

Email Ingestion Tool

The ingest_email.py tool provides a complete email ingestion pipeline:
# Basic ingestion
python tools/ingest_email.py -d emails.db inbox_dump/

# Ingest specific files
python tools/ingest_email.py -d emails.db msg1.eml msg2.eml

# Verbose output
python tools/ingest_email.py -d emails.db inbox_dump/ --verbose

Date Filtering

Filter emails by date range:
# Ingest only January 2024 emails
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --start-date 2024-01-01 \
    --stop-date 2024-02-01

# Date range is [start, stop) - start inclusive, stop exclusive

Pagination

Process emails in batches:
# Ingest first 20 emails
python tools/ingest_email.py -d emails.db inbox_dump/ --limit 20

# Skip first 100, process next 50
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --offset 100 \
    --limit 50

Filter Pipeline

The ingestion tool applies filters in this order:
1
Offset/Limit Slicing
2
Slice the input file list: files[offset:offset+limit]
3
Already-Ingested Check
4
Skip emails that were previously ingested
5
Date Range Filter
6
Filter by --start-date and --stop-date

Programmatic Email Ingestion

Create a custom ingestion pipeline:
import asyncio
from pathlib import Path
from dotenv import load_dotenv

from typeagent.emails.email_import import import_email_from_file
from typeagent.emails.email_memory import EmailMemory
from typeagent.emails.email_message import EmailMessage
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.storage.utils import create_storage_provider

load_dotenv()

async def ingest_emails():
    # Create settings
    settings = ConversationSettings()
    
    # Create storage provider
    settings.storage_provider = await create_storage_provider(
        settings.message_text_index_settings,
        settings.related_term_index_settings,
        "emails.db",
        EmailMessage
    )
    
    # Create email memory
    email_memory = await EmailMemory.create(settings)
    
    # Process email files
    email_dir = Path("inbox_dump")
    for email_file in email_dir.glob("*.eml"):
        source_id = str(email_file)
        
        # Skip if already ingested
        if await settings.storage_provider.is_source_ingested(source_id):
            print(f"Skipping {email_file.name} (already ingested)")
            continue
        
        try:
            # Import and ingest
            email = import_email_from_file(str(email_file))
            await email_memory.add_messages_with_indexing(
                [email],
                source_ids=[source_id]
            )
            print(f"Ingested {email_file.name}")
        
        except Exception as e:
            print(f"Failed to ingest {email_file.name}: {e}")
            # Mark as failed
            async with settings.storage_provider:
                await settings.storage_provider.mark_source_ingested(
                    source_id,
                    status=e.__class__.__name__
                )

if __name__ == "__main__":
    asyncio.run(ingest_emails())

Downloading Emails

TypeAgent includes tools for downloading emails from various sources:
Download emails using the Gmail API:
# Download 50 most recent emails (default)
cd tools/mail
python gmail_dump.py

# Download 200 emails
python gmail_dump.py --max-results 200

# Output to specific directory
python gmail_dump.py --output-dir ~/gmail_export

Gmail API Setup

1
Create Google Cloud App
2
  • Go to Google Cloud Console
  • Create a new project
  • Enable the Gmail API
  • 3
    Create OAuth Client
    4
  • Navigate to “Credentials” in sidebar
  • Click ”+ Create Credentials”
  • Select “OAuth client ID”
  • Choose “Desktop app”
  • Download JSON credentials
  • 5
    Configure Tool
    6
  • Save credentials as tools/mail/client_secret.json
  • Run gmail_dump.py
  • Complete OAuth flow in browser
  • Token saved to tools/mail/token.json
  • The Gmail API token expires after about a week. Delete token.json to trigger re-authentication.

    Email Features

    Reply Detection

    TypeAgent automatically detects and extracts only the latest response from email threads:
    from typeagent.emails.email_import import is_reply, get_last_response_in_thread
    
    # Check if email is a reply
    if is_reply(email_message):
        # Extract only the new content
        body = get_last_response_in_thread(body_text)
    

    Forward Detection

    from typeagent.emails.email_import import is_forwarded, get_forwarded_email_parts
    
    # Check if email is forwarded
    if is_forwarded(email_message):
        # Split into parts
        parts = get_forwarded_email_parts(email_text)
    

    Encoding Handling

    TypeAgent properly handles RFC 2047 encoded words:
    from typeagent.emails.email_import import decode_encoded_words
    
    # Decode encoded headers
    subject = decode_encoded_words("=?UTF-8?B?SGVsbG8gV29ybGQ=?=")
    print(subject)  # "Hello World"
    

    Querying Emails

    Once ingested, query emails using natural language:
    # Interactive query
    python tools/query.py --database emails.db
    
    # Single query
    python tools/query.py --database emails.db \
        --query "What emails did Alice send about the project?"
    

    Email Query Examples

    from typeagent import create_conversation
    from typeagent.emails.email_message import EmailMessage
    
    conversation = await create_conversation("emails.db", EmailMessage)
    
    # Who questions
    answer = await conversation.query("Who sent emails about the meeting?")
    answer = await conversation.query("Who did Alice email yesterday?")
    
    # What questions  
    answer = await conversation.query("What was discussed in the project emails?")
    answer = await conversation.query("What action items were mentioned?")
    
    # When questions
    answer = await conversation.query("When was the deadline mentioned?")
    answer = await conversation.query("What emails were sent last week?")
    
    # Topic searches
    answer = await conversation.query("Find emails about budget approval")
    answer = await conversation.query("Show me emails related to deployment")
    

    Knowledge Extraction from Emails

    Emails are automatically enriched with semantic knowledge:
    # EmailMessage.metadata.get_knowledge() extracts:
    # - Entities: People (sender, recipients), email addresses
    # - Actions: "sent email", "received email"  
    # - Topics: Subject line
    # - Relationships: sender -> recipient connections
    
    knowledge = email.metadata.get_knowledge()
    print(f"Entities: {len(knowledge.entities)}")
    print(f"Actions: {len(knowledge.actions)}")
    print(f"Topics: {knowledge.topics}")
    

    Entity Extraction

    Email addresses are parsed into entities:
    # "Alice Smith <alice@example.com>" becomes:
    # - Entity: "Alice Smith" (type: person)
    #   - Facet: email_address = alice@example.com
    # - Entity: "alice@example.com" (type: email_address, alias)
    

    Action Extraction

    Email actions capture communication:
    # For email from alice@example.com to bob@example.com:
    # - Action: "Alice Smith" sent email to "Bob Jones"
    # - Action: "alice@example.com" sent email to "bob@example.com"
    # - Action: "Bob Jones" received email from "Alice Smith"
    # - Action: "bob@example.com" received email from "alice@example.com"
    

    Performance Tuning

    Email ingestion can take 1-2 seconds per message due to LLM-based knowledge extraction.

    Batch Size Configuration

    from typeagent.knowpro.convsettings import ConversationSettings
    
    settings = ConversationSettings()
    
    # Adjust concurrent extraction (default: 4)
    settings.semantic_ref_index_settings.batch_size = 4
    

    Progress Monitoring

    import time
    
    start_time = time.time()
    success_count = 0
    batch_size = 4
    
    for i, email in enumerate(emails):
        await email_memory.add_messages_with_indexing([email])
        success_count += 1
        
        # Print progress periodically
        if (success_count % batch_size) == 0:
            elapsed = time.time() - start_time
            semref_count = await semref_collection.size()
            print(f"{success_count} imported | "
                  f"{semref_count} semrefs | "
                  f"{elapsed:.1f}s elapsed")
    

    Error Handling

    Handle common email ingestion errors:
    import traceback
    import openai
    
    success_count = 0
    failed_count = 0
    skipped_count = 0
    
    for source_id, email_file in email_files:
        try:
            email = import_email_from_file(str(email_file))
            
            # Apply date filter
            if not email_matches_date_filter(
                email.timestamp,
                start_date,
                stop_date
            ):
                skipped_count += 1
                continue
            
            await email_memory.add_messages_with_indexing(
                [email],
                source_ids=[source_id]
            )
            success_count += 1
        
        except openai.AuthenticationError as e:
            print(f"Authentication error: {e}")
            break  # Fatal error
        
        except Exception as e:
            failed_count += 1
            print(f"Error processing {source_id}: {e}")
            
            # Mark as failed
            async with storage_provider:
                await storage_provider.mark_source_ingested(
                    source_id,
                    status=e.__class__.__name__
                )
            
            if verbose:
                traceback.print_exc()
    
    print(f"\nSuccessfully imported {success_count} emails")
    print(f"Skipped {skipped_count} emails (date filter)")
    print(f"Failed to import {failed_count} emails")
    

    Example: Complete Email Pipeline

    Here’s a complete example from download to query:
    #!/bin/bash
    # complete_email_pipeline.sh
    
    set -e  # Exit on error
    
    # 1. Download emails from Gmail
    echo "Downloading emails..."
    cd tools/mail
    python gmail_dump.py --max-results 100 --output-dir ../../email_dump
    cd ../..
    
    # 2. Ingest emails into database
    echo "Ingesting emails..."
    python tools/ingest_email.py \
        -d emails.db \
        email_dump/ \
        --start-date 2024-01-01 \
        --verbose
    
    # 3. Query the database
    echo "Database ready for queries!"
    echo "Run: python tools/query.py --database emails.db"
    

    Next Steps

    Build docs developers (and LLMs) love