Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/typeagent-py/llms.txt

Use this file to discover all available pages before exploring further.

TypeAgent provides specialized support for ingesting and querying podcast transcripts, including WebVTT files with speaker annotations.

Podcast Workflow

Working with podcasts follows a simple two-step workflow:
1
Ingest Transcript
2
Parse and index podcast transcript into TypeAgent
3
Query Content
4
Ask questions about what was discussed in the podcast

Podcast Message Format

Podcasts use conversation messages with speaker and recipient metadata:
from typeagent.knowpro.universal_message import ConversationMessage, ConversationMessageMeta

message = ConversationMessage(
    text_chunks=["Welcome to our podcast about AI and science fiction."],
    metadata=ConversationMessageMeta(
        speaker="host",
        recipients=["guest", "audience"]
    ),
    timestamp="1970-01-01T00:00:00Z"  # Relative timestamp
)
TypeAgent uses Unix epoch (1970-01-01) as the base timestamp for podcasts when the actual date is unknown, preserving relative timing.

Ingesting Plain Text Transcripts

For simple speaker-prefixed transcripts:
import asyncio
from datetime import timedelta
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import format_timestamp_utc, UNIX_EPOCH
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    settings = ConversationSettings()
    
    podcast = await ingest_podcast(
        transcript_file_path="transcript.txt",
        settings=settings,
        podcast_name="AI Discussion",
        length_minutes=60.0,  # Total podcast length
        dbname="podcast.db",
        verbose=True
    )
    
    print(f"Ingested {await podcast.messages.size()} messages")

if __name__ == "__main__":
    asyncio.run(main())

Transcript Format

Plain text transcripts should use SPEAKER: text format:
HOST: Welcome to the AI podcast.
GUEST: Thanks for having me.
HOST: Let's talk about machine learning.
GUEST: Machine learning is fascinating because...

Timestamp Assignment

TypeAgent assigns timestamps proportionally based on text length:
# Timestamps are calculated based on:
# - Total podcast length (length_minutes)
# - Relative text length of each message
# - Base date (default: Unix epoch)

podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=None,  # Uses Unix epoch
    length_minutes=60.0
)

Ingesting WebVTT Transcripts

For WebVTT files with timing and speaker annotations:
# Basic VTT ingestion
python tools/ingest_vtt.py transcript.vtt -d podcast.db

# With custom name
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --name "Episode 53: Adrian Tchaikovsky"

# Merge consecutive speaker segments
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --merge

# With custom batch size
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --batchsize 10

# Verbose output
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --verbose

WebVTT Format

TypeAgent supports WebVTT files with voice tags:
WEBVTT

00:00:00.000 --> 00:00:05.000
<v Host>Welcome to Behind the Tech.

00:00:05.000 --> 00:00:12.000
<v Kevin>I'm Kevin Scott, CTO of Microsoft.

00:00:12.000 --> 00:00:18.000
<v Kevin>Today we're talking with Adrian Tchaikovsky.

00:00:18.000 --> 00:00:25.000
<v Adrian>Thanks for having me on the show.

Voice Tag Parsing

TypeAgent parses WebVTT voice annotations:
from typeagent.transcripts.transcript_ingest import parse_voice_tags

# Parse voice-tagged text
text = "<v Host>Welcome to the show<v Guest>Thanks for having me"
segments = parse_voice_tags(text)

# Returns: [
#     ("host", "Welcome to the show"),
#     ("guest", "Thanks for having me")
# ]

Multiple VTT Files

Ingest multiple VTT files as a continuous conversation:
# Ingest multiple files with time continuity
python tools/ingest_vtt.py \
    episode1.vtt episode2.vtt episode3.vtt \
    -d combined.db \
    --name "Complete Series"

Programmatic Podcast Ingestion

Create custom podcast ingestion pipelines:
import asyncio
from datetime import datetime, timezone
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings  
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def ingest_podcast_series():
    settings = ConversationSettings()
    
    # Configure knowledge extraction
    settings.semantic_ref_index_settings.auto_extract_knowledge = True
    settings.semantic_ref_index_settings.batch_size = 4
    
    # Ingest podcast
    podcast = await ingest_podcast(
        transcript_file_path="episode_53.txt",
        settings=settings,
        podcast_name="Episode 53: Adrian Tchaikovsky",
        start_date=datetime(2024, 1, 15, tzinfo=timezone.utc),
        length_minutes=45.0,
        dbname="episode_53.db",
        batch_size=10,  # Override batch size
        verbose=True
    )
    
    print(f"Podcast '{podcast.name_tag}' ingested successfully")
    print(f"Messages: {await podcast.messages.size()}")
    print(f"Semantic refs: {await podcast.semantic_refs.size()}")
    
    return podcast

if __name__ == "__main__":
    asyncio.run(ingest_podcast_series())

Podcast-Specific Features

Participant Alias Resolution

TypeAgent automatically builds aliases for participants:
# "Kevin Scott" is aliased to "Kevin"
# "Adrian Tchaikovsky" is aliased to "Adrian"

# Queries work with either form:
await podcast.query("What did Kevin say?")
await podcast.query("What did Kevin Scott say?")  # Same results

Synonym Expansion

Podcasts include verb synonyms from podcastVerbs.json:
[
  {
    "term": "discuss",
    "relatedTerms": ["talk about", "mention", "bring up", "cover"]
  },
  {
    "term": "explain",
    "relatedTerms": ["describe", "clarify", "elaborate"]
  }
]
This enables queries like:
# All of these find similar results:
await podcast.query("What did they discuss about AI?")
await podcast.query("What did they talk about regarding AI?")
await podcast.query("What did they mention about AI?")

Querying Podcasts

Query ingested podcasts using natural language:
# Interactive query mode
python tools/query.py --database podcast.db

# Single query
python tools/query.py --database podcast.db \
    --query "What did Kevin say to Adrian about science fiction?"

Podcast Query Examples

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()
podcast = await Podcast.read_from_file(
    "tests/testdata/Episode_53_AdrianTchaikovsky_index",
    settings
)

# Who questions
answer = await podcast.query("Who is Adrian Tchaikovsky?")
answer = await podcast.query("Who spoke about AI ethics?")

# What questions
answer = await podcast.query("What did Kevin say about science fiction?")
answer = await podcast.query("What books were mentioned?")

# How questions  
answer = await podcast.query("How was Asimov mentioned?")
answer = await podcast.query("How did they describe the challenges?")

# Topic searches
answer = await podcast.query("What was discussed about AI ethics?")
answer = await podcast.query("Tell me about the robotics discussion")

Podcast Serialization

Save and load podcast data efficiently:

Saving Podcasts

from typeagent.podcasts.podcast import Podcast

# Save to files
await podcast.write_to_file("podcast_index")

# Creates two files:
# - podcast_index_data.json (metadata, messages, indexes)
# - podcast_index_embeddings.bin (embedding vectors)

Loading Podcasts

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()

# Load from files
podcast = await Podcast.read_from_file(
    "podcast_index",  # Filename prefix
    settings
)

print(f"Loaded {await podcast.messages.size()} messages")
Embedding files are binary and specific to the embedding model used during ingestion.

Advanced Podcast Features

Resuming Interrupted Ingestion

# Resume from message 100 if ingestion was interrupted
podcast = await ingest_podcast(
    transcript_file_path="large_transcript.txt",
    settings=settings,
    dbname="podcast.db",
    start_message=100,  # Resume from this message
    batch_size=50,
    verbose=True
)

Custom Timestamp Base

from datetime import datetime, timezone

# Use actual podcast date
podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=datetime(2024, 3, 15, 14, 30, tzinfo=timezone.utc),
    length_minutes=60.0
)

Extracting Metadata

from typeagent.transcripts.transcript_ingest import (
    get_transcript_duration,
    get_transcript_speakers
)

# Analyze VTT file before ingestion
duration = get_transcript_duration("podcast.vtt")
speakers = get_transcript_speakers("podcast.vtt")

print(f"Duration: {duration:.2f} seconds")
print(f"Speakers: {speakers}")

Podcast Knowledge Extraction

Podcasts are enriched with semantic knowledge:
# Knowledge extracted includes:
# - Entities: Speaker names, mentioned people/organizations
# - Actions: Discussions, explanations, questions
# - Topics: Subjects covered
# - Relationships: Speaker interactions

result = await podcast.add_messages_with_indexing(messages)
print(f"Extracted {result.semrefs_added} semantic references")

Complete Podcast Example

Here’s a complete example from ingestion to query:
import asyncio
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    # 1. Configure settings
    settings = ConversationSettings()
    settings.semantic_ref_index_settings.batch_size = 4
    
    # 2. Ingest podcast transcript
    print("Ingesting podcast...")
    podcast = await ingest_podcast(
        transcript_file_path="podcast_transcript.txt",
        settings=settings,
        podcast_name="Tech Talk Episode 1",
        length_minutes=45.0,
        dbname="tech_talk.db",
        verbose=True
    )
    
    # 3. Check ingestion results
    msg_count = await podcast.messages.size()
    ref_count = await podcast.semantic_refs.size()
    print(f"\nIngested {msg_count} messages")
    print(f"Extracted {ref_count} semantic references")
    
    # 4. Query the podcast
    print("\nQuerying podcast...")
    
    questions = [
        "Who were the speakers?",
        "What topics were discussed?",
        "What was said about AI?"
    ]
    
    for question in questions:
        print(f"\nQ: {question}")
        answer = await podcast.query(question)
        print(f"A: {answer}")
    
    # 5. Interactive mode
    print("\n" + "="*50)
    print("Entering interactive mode (type 'q' to exit)")
    print("="*50)
    
    while True:
        try:
            question = input("\ntypeagent> ")
            if question.strip().lower() in ('q', 'quit', 'exit'):
                break
            if not question.strip():
                continue
            
            answer = await podcast.query(question)
            print(answer)
        
        except (EOFError, KeyboardInterrupt):
            break
    
    print("\nGoodbye!")

if __name__ == "__main__":
    asyncio.run(main())

Performance Tips

1
Optimize Batch Size
2
Adjust based on podcast length:
3
# Short podcasts (< 30 min): smaller batches
settings.semantic_ref_index_settings.batch_size = 4

# Long podcasts (> 60 min): larger batches  
settings.semantic_ref_index_settings.batch_size = 10
4
Monitor Progress
5
Track ingestion with verbose mode:
6
podcast = await ingest_podcast(
    transcript_file_path="long_transcript.txt",
    settings=settings,
    verbose=True  # Shows progress updates
)
7
Resume Long Ingestions
8
For very long transcripts, ingest in stages:
9
# First batch
python -c "ingest_podcast(..., start_message=0, batch_size=100)"

# Resume if interrupted
python -c "ingest_podcast(..., start_message=100, batch_size=100)"

Troubleshooting

VTT Parsing Errors

import webvtt

# Validate VTT file before ingestion
try:
    vtt = webvtt.read("podcast.vtt")
    print(f"Valid VTT with {len(vtt)} captions")
except Exception as e:
    print(f"Invalid VTT file: {e}")

Speaker Name Normalization

Speaker names are normalized to lowercase:
# In transcript: "KEVIN:", "Kevin:", "kevin:" all become "kevin"
# Queries work case-insensitively

Missing Timestamps

For transcripts without timing:
# TypeAgent assigns proportional timestamps
# based on text length and total podcast duration
podcast = await ingest_podcast(
    transcript_file_path="no_timestamps.txt",
    settings=settings,
    length_minutes=60.0  # Distribute across 60 minutes
)

Next Steps

Build docs developers (and LLMs) love