Generation Pipeline

Pipeline Overview

The video generation pipeline consists of 11 stages that transform a text prompt into a complete video presentation with synchronized audio and visuals.

Pipeline Stages

Stage 1: Initialization (0-10%)

Duration: ~1 second Location: backend/app.py:223-237 Tasks:

Receive POST request with topic, num_slides, language, tone
Sanitize topic for file naming
Create generation ID
Initialize progress tracking

Code:

topic_clean = topic[:30].replace(' ', '_').replace(':', '').replace('/', '_')
topic_clean = topic_clean.replace('"', '').replace("'", '').replace('?', '').replace('!', '')

generation_id = topic_clean
update_progress(generation_id, 0, "started", "🚀 Starting generation...")

Output: None (metadata only)

Stage 2: Content Generation (10-20%)

Duration: 10-30 seconds Location: backend/app.py:240-243 Generator: ContentGenerator (content_generator.py) Process:

Build Prompt

Create detailed prompt with requirements for slide structure, mutual exclusivity rules, and content guidelines

Call Gemini API

Send prompt to Gemini with response_mime_type="application/json"

Parse Response

Clean JSON markers (json, ) and parse response

Validate Structure

Check all required fields present
Enforce mutual exclusivity (animation XOR image)
Add missing fields with defaults

Save Content

Write to outputs/slides/{topic}_content.json

Output Structure:

{
  "topic": "Newton's Third Law of Motion",
  "total_slides": 5,
  "slides": [
    {
      "slide_number": 1,
      "title": "Newton's Third Law",
      "content_text": "For every action, there is an equal and opposite reaction",
      "needs_image": false,
      "image_keyword": "",
      "needs_animation": true,
      "animation_description": "Show rocket with force vectors",
      "duration": 6.0
    }
  ]
}

Error Handling:

try:
    content_data = content_gen.generate_content(topic, num_slides)
except Exception as e:
    print(f"Content generation error: {e}")
    traceback.print_exc()
    raise

Stage 3: Script Generation (20-30%)

Duration: 10-20 seconds Location: backend/app.py:246-249 Generator: ScriptGenerator (script_generator.py) Process:

Prepare Slide Info

Extract title, content, duration, and visual flags for each slide

Build Context Prompt

Include language, tone instructions, and special handling for animations

Generate Scripts

Call Gemini to create natural narration text for each slide

Estimate Timestamps

Calculate cumulative start/end times based on slide durations (will be corrected later)

Special Cases:

Animation Slides

Narration includes visual descriptions:

“As you can see on screen…”
“Watch as the rocket…”
“Notice how the force vectors…”

Image Slides

Natural image references:

“Looking at this image…”
“This diagram shows…”

Text Slides

Pure conceptual explanation without visual references

Output Structure:

{
  "topic": "Newton's Third Law of Motion",
  "total_duration": 30.0,
  "language": "english",
  "slide_scripts": [
    {
      "slide_number": 1,
      "start_time": 0.0,
      "end_time": 6.0,
      "narration_text": "Today we'll explore Newton's Third Law, which states that for every action, there is an equal and opposite reaction."
    }
  ]
}

Stage 4: Audio Generation (30-48%)

Duration: 30-60 seconds (depends on slide count) Location: backend/app.py:252-303 Generator: VoiceGenerator (voice_generator.py) Process:

Generate Per-Slide Audio

Loop through each slide script:

for idx, slide_script in enumerate(script_data['slide_scripts'], 1):
    audio_path = voice_gen.generate_voice_for_slide(
        slide_script['narration_text'],
        slide_num,
        topic,
        language
    )
    slide_audio_paths[slide_num] = audio_path

Progress: 30% + (idx/total * 15%)

Measure Actual Durations

from moviepy import AudioFileClip
audio_clip = AudioFileClip(audio_path)
actual_durations[slide_num] = audio_clip.duration
audio_clip.close()

Update Timestamps

Recalculate slide start/end times based on actual audio:

current_time = 0
for slide_script in script_data['slide_scripts']:
    actual_duration = actual_durations[slide_num]
    slide_script['start_time'] = current_time
    slide_script['end_time'] = current_time + actual_duration
    current_time += actual_duration

API Call (voice_generator.py):

response = requests.post(
    Config.SARVAM_TTS_URL,
    headers={"API-Subscription-Key": Config.SARVAM_API_KEY},
    json={
        "text": narration_text,
        "language_code": language_code,
        "model": Config.SARVAM_MODEL,
        "speaker": "meera"  # or other voices
    }
)
audio_data = response.json()["audios"][0]

Output:

Individual files: outputs/audio/{topic}_slide_1.mp3, etc.
Durations stored in memory for timestamp correction

Stage 4.5: Audio Combining (48-49%)

Duration: 2-5 seconds Location: backend/app.py:301-303 Task: Concatenate all slide audio files into single track

audio_path = voice_gen.combine_slide_audios(slide_audio_paths, topic)
# Output: outputs/audio/{topic}_combined.mp3

Implementation (voice_generator.py):

from moviepy import concatenate_audioclips, AudioFileClip

audio_clips = [AudioFileClip(path) for path in slide_audio_paths.values()]
combined = concatenate_audioclips(audio_clips)
combined.write_audiofile(output_path, codec='mp3')

Stage 5: Visual Generation (50-80%)

Duration: 1-3 minutes (varies by visual complexity) Location: backend/app.py:306-433 Generators: ManimGenerator, ImageFetcher, SlideRenderer Process Loop:

for idx, slide in enumerate(content_data['slides'], 1):
    visual_progress = 50 + int((idx / total_slides) * 30)
    
    has_animation = slide.get('needs_animation', False)
    has_image = slide.get('needs_image', False)
    
    # Mutual exclusivity enforcement
    if has_animation and has_image:
        print(f"⚠️ ERROR: Slide {slide_num} has BOTH flags!")
        has_image = False  # Animation takes priority

Branch 1: Animation Slides (50-80%, portion):

Generate Manim Code

animation_code = manim_gen.generate_animation_code(slide, duration)
# Returns: Python code string

Save Code

code_path = manim_gen.save_animation_code(
    animation_code, slide_num, topic
)
# Saves: outputs/manim_code/{topic}_slide_{num}.py

Render Animation

video_path = video_renderer.render_manim_animation(
    code_path,
    f"{topic}_slide_{slide_num}"
)
# Executes: manim -qh code_path.py SceneName
# Output: outputs/manim_output/{scene}.mp4

Create Base Slide

base_slide = slide_renderer.create_slide_with_animation_placeholder(
    slide['title'],
    slide['content_text'],
    slide_num,
    topic
)
# Output: PNG with text on left, dark area on right

Store Composite Data

slide_paths[slide_num] = {
    'type': 'animation_composite',
    'base_slide': base_slide,
    'animation': video_path
}

Branch 2: Image Slides (50-80%, portion):

Fetch Image

image_path = image_fetcher.fetch_image(
    slide['image_keyword'],
    slide_num,
    topic
)
# Calls Unsplash API, downloads to outputs/images/

Composite with Text

slide_with_img = slide_renderer.create_slide_with_image(
    slide['title'],
    slide['content_text'],
    image_path,
    slide_num,
    topic
)
# Output: PNG with text on left, image on right

Store Path

slide_paths[slide_num] = slide_with_img

Branch 3: Text-Only Slides (50-80%, portion):

text_slide = slide_renderer.create_text_slide(
    slide['title'],
    slide['content_text'],
    slide_num,
    topic
)
slide_paths[slide_num] = text_slide
# Output: PNG with centered title and content

Progress Breakdown (app.py:435-438):

print(f"\n📊 Final visual breakdown:")
print(f"   Animations: {len(animation_paths)}")
print(f"   Images: {len(image_paths)}")
print(f"   Text-only: {total_slides - len(animation_paths) - len(image_paths)}")

Stage 6: Video Composition (85-95%)

Duration: 30-90 seconds Location: backend/app.py:441-451 Composer: VideoComposer (video_composer.py) Process:

Load Slide Clips

for slide in content_data['slides']:
    duration = slide_script['end_time'] - slide_script['start_time']
    
    if isinstance(slide_data, dict) and slide_data['type'] == 'animation_composite':
        slide_clip = composer.composite_animation_on_slide(
            slide_data['base_slide'],
            slide_data['animation'],
            duration
        )
    else:
        slide_clip = composer.create_slide_video(slide_data, duration)

Animation Compositing

For animation slides:

# Load base slide (PNG) and animation (MP4)
slide_clip = ImageClip(slide_image_path, duration=duration)
animation_clip = VideoFileClip(animation_video_path)

# Adjust animation duration
if animation_clip.duration < duration:
    # Loop animation
    num_loops = int(duration / animation_clip.duration) + 1
    animation_adjusted = concatenate_videoclips([animation_clip] * num_loops)
    animation_adjusted = animation_adjusted.subclipped(0, duration)

# Resize and position
animation_final = animation_adjusted.resized(new_size=(850, 700))
animation_final = animation_final.with_position((1010, 250))

# Composite
composite = CompositeVideoClip(
    [slide_clip, animation_final],
    size=(1920, 1080)
)

Concatenate Slides

final_video = concatenate_videoclips(slide_clips, method="compose")

Add Audio

audio = AudioFileClip(audio_path)
final_video = final_video.with_audio(audio)

Render Final MP4

final_video.write_videofile(
    str(output_path),
    fps=30,
    codec='libx264',
    audio_codec='aac',
    preset='medium',
    bitrate='5000k',
    audio_bitrate='192k'
)
# Output: outputs/final/{topic}_final.mp4

Timing Validation (video_composer.py:289-290):

if abs(final_video.duration - audio.duration) > 0.5:
    print(f"⚠️ Warning: Video ({final_video.duration:.1f}s) doesn't match audio ({audio.duration:.1f}s)")

Stage 7: Completion (100%)

Duration: Instant Location: backend/app.py:453-470 Tasks:

Extract Filename

video_filename = Path(final_video_path).name
# e.g., "Newtons_Third_Law_final.mp4"

Update Progress

update_progress(generation_id, 100, "completed", "✅ Video generation complete!")

Return Response

return GenerateResponse(
    status="success",
    message="Presentation video generated successfully",
    content_data=content_data,
    script_data=script_data,
    video_path=final_video_path,
    video_filename=video_filename
)

Frontend Transition:

if (response.data.status === "success") {
  const generatedData = {
    content: response.data.content_data,
    script: response.data.script_data,
    videoPath: response.data.video_path,
    videoFilename: response.data.video_filename
  };
  onGenerationComplete(generatedData);
}

Error Handling & Fallbacks

Content Generation Errors

try:
    content_data = content_gen.generate_content(topic, num_slides)
except Exception as e:
    error_msg = f"Error: {str(e)}"
    print(f"Full error:\n{traceback.format_exc()}")
    update_progress(generation_id, 0, "error", f"❌ {error_msg}")
    raise HTTPException(status_code=500, detail=error_msg)

Common Issues:

Gemini API timeout → Retry with exponential backoff
Invalid JSON response → Clean and re-parse
Missing fields → Add defaults and continue

Audio Generation Errors

try:
    audio_path = voice_gen.generate_voice_for_slide(...)
except Exception as e:
    print(f"Error generating audio for slide {slide_num}: {e}")
    # Use estimated duration from content
    actual_durations[slide_num] = slide_script['end_time'] - slide_script['start_time']

Fallback: Continue without audio for that slide (silent)

Image Fetch Errors

try:
    image_path = image_fetcher.fetch_image(keyword, slide_num, topic)
    if not image_path:
        raise ValueError("Image fetch returned empty path")
except Exception as e:
    print(f"❌ Error fetching image for slide {slide_num}: {e}")
    # Fallback to text-only slide
    text_slide = slide_renderer.create_text_slide(...)
    slide_paths[slide_num] = text_slide

Animation Generation Errors

try:
    animation_code = manim_gen.generate_animation_code(slide, duration)
    video_path = video_renderer.render_manim_animation(code_path, scene_name)
except Exception as e:
    print(f"❌ Error generating animation for slide {slide_num}: {e}")
    traceback.print_exc()
    # Fallback to text-only slide
    text_slide = slide_renderer.create_text_slide(...)
    slide_paths[slide_num] = text_slide

Common Manim Errors:

Syntax error in generated code → Show error, fallback to text
Rendering timeout → Skip animation, use text slide
Missing dependencies → Warning in logs, text fallback

Progress Tracking

Progress Percentages

Stage	Start %	End %	Duration	Status ID
Initialization	0	10	1s	`started`
Content Generation	10	20	10-30s	`generating_content`
Script Generation	20	30	10-20s	`generating_scripts`
Audio Generation	30	48	30-60s	`generating_audio`
Audio Combining	48	49	2-5s	`combining_audio`
Visual Generation	50	80	60-180s	`generating_media`
- Animation	50-80	(portion)	varies	`generating_animation`
- Images	50-80	(portion)	varies	`fetching_image`
- Text Slides	50-80	(portion)	varies	`generating_slide`
Video Composition	85	95	30-90s	`composing_video`
Completion	95	100	instant	`completed`

Real-time Updates

Backend (app.py:52-61):

def update_progress(generation_id: str, progress: int, status: str, message: str):
    timestamp = datetime.now().strftime("%H:%M:%S")
    generation_status[generation_id] = {
        "status": status,
        "progress": progress,
        "message": message,
        "timestamp": timestamp
    }
    print(f"[{timestamp}] {message}")

Frontend (useSSEProgress.jsx):

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  setProgress(data.progress);
  setStatus(data.status);
  setMessage(data.message);
};

Performance Characteristics

Total Time by Slide Count

Times are approximate and vary based on:

Gemini API response time
Number of animations (slowest step)
Audio length
System performance

Slides	Text-Only	With Images	With Animations	Total
3	1.5 min	2 min	3.5 min	~2-3 min
5	2 min	2.5 min	4.5 min	~3-5 min
10	3 min	4 min	7 min	~5-8 min

Bottlenecks

Manim Rendering (30-60s per animation)
- Solution: Limit animations to 1-2 per presentation
- Alternative: Pre-render common animations
Gemini API Calls (10-30s per call)
- Solution: Use streaming responses (future)
- Cache: Store common content patterns
Video Composition (30-90s)
- Depends on: Total video length, number of clips
- Optimization: Use GPU acceleration if available

Next Steps

API Reference

Explore detailed API documentation

Troubleshooting

Common issues and solutions

Backend Architecture

Understand the backend structure

Frontend Architecture

Understand the frontend structure

Architecture

Contributing

Pipeline Overview

Pipeline Stages

Stage 1: Initialization (0-10%)

Stage 2: Content Generation (10-20%)

Stage 3: Script Generation (20-30%)

Stage 4: Audio Generation (30-48%)

Stage 4.5: Audio Combining (48-49%)

Stage 5: Visual Generation (50-80%)

Stage 6: Video Composition (85-95%)

Stage 7: Completion (100%)

Error Handling & Fallbacks

Content Generation Errors

Audio Generation Errors

Image Fetch Errors

Animation Generation Errors

Progress Tracking

Progress Percentages

Real-time Updates

Performance Characteristics

Total Time by Slide Count

Bottlenecks

Next Steps

API Reference

Troubleshooting

Backend Architecture

Frontend Architecture

Build docs developers (and LLMs) love

Architecture

Contributing

Documentation Index

​Pipeline Overview

​Pipeline Stages

​Stage 1: Initialization (0-10%)

​Stage 2: Content Generation (10-20%)

​Stage 3: Script Generation (20-30%)

​Stage 4: Audio Generation (30-48%)

​Stage 4.5: Audio Combining (48-49%)

​Stage 5: Visual Generation (50-80%)

​Stage 6: Video Composition (85-95%)

​Stage 7: Completion (100%)

​Error Handling & Fallbacks

​Content Generation Errors

​Audio Generation Errors

​Image Fetch Errors

​Animation Generation Errors

​Progress Tracking

​Progress Percentages

​Real-time Updates

​Performance Characteristics

​Total Time by Slide Count

​Bottlenecks

​Next Steps

API Reference

Troubleshooting

Backend Architecture

Frontend Architecture

Build docs developers (and LLMs) love

Pipeline Overview

Pipeline Stages

Stage 1: Initialization (0-10%)

Stage 2: Content Generation (10-20%)

Stage 3: Script Generation (20-30%)

Stage 4: Audio Generation (30-48%)

Stage 4.5: Audio Combining (48-49%)

Stage 5: Visual Generation (50-80%)

Stage 6: Video Composition (85-95%)

Stage 7: Completion (100%)

Error Handling & Fallbacks

Content Generation Errors

Audio Generation Errors

Image Fetch Errors

Animation Generation Errors

Progress Tracking

Progress Percentages

Real-time Updates

Performance Characteristics

Total Time by Slide Count

Bottlenecks

Next Steps