Documentation Index Fetch the complete documentation index at: https://mintlify.com/Kamal-Nayan-Kumar/AI-Video-Gen/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline Overview
The video generation pipeline consists of 11 stages that transform a text prompt into a complete video presentation with synchronized audio and visuals.
Pipeline Stages
Stage 1: Initialization (0-10%)
Duration : ~1 second
Location : backend/app.py:223-237
Tasks :
Receive POST request with topic, num_slides, language, tone
Sanitize topic for file naming
Create generation ID
Initialize progress tracking
Code :
topic_clean = topic[: 30 ].replace( ' ' , '_' ).replace( ':' , '' ).replace( '/' , '_' )
topic_clean = topic_clean.replace( '"' , '' ).replace( "'" , '' ).replace( '?' , '' ).replace( '!' , '' )
generation_id = topic_clean
update_progress(generation_id, 0 , "started" , "🚀 Starting generation..." )
Output : None (metadata only)
Stage 2: Content Generation (10-20%)
Duration : 10-30 seconds
Location : backend/app.py:240-243
Generator : ContentGenerator (content_generator.py)
Process :
Build Prompt
Create detailed prompt with requirements for slide structure, mutual exclusivity rules, and content guidelines
Call Gemini API
Send prompt to Gemini with response_mime_type="application/json"
Parse Response
Clean JSON markers (json, ) and parse response
Validate Structure
Check all required fields present
Enforce mutual exclusivity (animation XOR image)
Add missing fields with defaults
Save Content
Write to outputs/slides/{topic}_content.json
Output Structure :
{
"topic" : "Newton's Third Law of Motion" ,
"total_slides" : 5 ,
"slides" : [
{
"slide_number" : 1 ,
"title" : "Newton's Third Law" ,
"content_text" : "For every action, there is an equal and opposite reaction" ,
"needs_image" : false ,
"image_keyword" : "" ,
"needs_animation" : true ,
"animation_description" : "Show rocket with force vectors" ,
"duration" : 6.0
}
]
}
Error Handling :
try :
content_data = content_gen.generate_content(topic, num_slides)
except Exception as e:
print ( f "Content generation error: { e } " )
traceback.print_exc()
raise
Stage 3: Script Generation (20-30%)
Duration : 10-20 seconds
Location : backend/app.py:246-249
Generator : ScriptGenerator (script_generator.py)
Process :
Prepare Slide Info
Extract title, content, duration, and visual flags for each slide
Build Context Prompt
Include language, tone instructions, and special handling for animations
Generate Scripts
Call Gemini to create natural narration text for each slide
Estimate Timestamps
Calculate cumulative start/end times based on slide durations (will be corrected later)
Special Cases :
Narration includes visual descriptions:
“As you can see on screen…”
“Watch as the rocket…”
“Notice how the force vectors…”
Natural image references:
“Looking at this image…”
“This diagram shows…”
Pure conceptual explanation without visual references
Output Structure :
{
"topic" : "Newton's Third Law of Motion" ,
"total_duration" : 30.0 ,
"language" : "english" ,
"slide_scripts" : [
{
"slide_number" : 1 ,
"start_time" : 0.0 ,
"end_time" : 6.0 ,
"narration_text" : "Today we'll explore Newton's Third Law, which states that for every action, there is an equal and opposite reaction."
}
]
}
Stage 4: Audio Generation (30-48%)
Duration : 30-60 seconds (depends on slide count)
Location : backend/app.py:252-303
Generator : VoiceGenerator (voice_generator.py)
Process :
Generate Per-Slide Audio
Loop through each slide script: for idx, slide_script in enumerate (script_data[ 'slide_scripts' ], 1 ):
audio_path = voice_gen.generate_voice_for_slide(
slide_script[ 'narration_text' ],
slide_num,
topic,
language
)
slide_audio_paths[slide_num] = audio_path
Progress: 30% + (idx/total * 15%)
Measure Actual Durations
from moviepy import AudioFileClip
audio_clip = AudioFileClip(audio_path)
actual_durations[slide_num] = audio_clip.duration
audio_clip.close()
Update Timestamps
Recalculate slide start/end times based on actual audio: current_time = 0
for slide_script in script_data[ 'slide_scripts' ]:
actual_duration = actual_durations[slide_num]
slide_script[ 'start_time' ] = current_time
slide_script[ 'end_time' ] = current_time + actual_duration
current_time += actual_duration
API Call (voice_generator.py):
response = requests.post(
Config. SARVAM_TTS_URL ,
headers = { "API-Subscription-Key" : Config. SARVAM_API_KEY },
json = {
"text" : narration_text,
"language_code" : language_code,
"model" : Config. SARVAM_MODEL ,
"speaker" : "meera" # or other voices
}
)
audio_data = response.json()[ "audios" ][ 0 ]
Output :
Individual files: outputs/audio/{topic}_slide_1.mp3, etc.
Durations stored in memory for timestamp correction
Stage 4.5: Audio Combining (48-49%)
Duration : 2-5 seconds
Location : backend/app.py:301-303
Task : Concatenate all slide audio files into single track
audio_path = voice_gen.combine_slide_audios(slide_audio_paths, topic)
# Output: outputs/audio/{topic}_combined.mp3
Implementation (voice_generator.py):
from moviepy import concatenate_audioclips, AudioFileClip
audio_clips = [AudioFileClip(path) for path in slide_audio_paths.values()]
combined = concatenate_audioclips(audio_clips)
combined.write_audiofile(output_path, codec = 'mp3' )
Stage 5: Visual Generation (50-80%)
Duration : 1-3 minutes (varies by visual complexity)
Location : backend/app.py:306-433
Generators : ManimGenerator, ImageFetcher, SlideRenderer
Process Loop :
for idx, slide in enumerate (content_data[ 'slides' ], 1 ):
visual_progress = 50 + int ((idx / total_slides) * 30 )
has_animation = slide.get( 'needs_animation' , False )
has_image = slide.get( 'needs_image' , False )
# Mutual exclusivity enforcement
if has_animation and has_image:
print ( f "⚠️ ERROR: Slide { slide_num } has BOTH flags!" )
has_image = False # Animation takes priority
Branch 1: Animation Slides (50-80%, portion):
Generate Manim Code
animation_code = manim_gen.generate_animation_code(slide, duration)
# Returns: Python code string
Save Code
code_path = manim_gen.save_animation_code(
animation_code, slide_num, topic
)
# Saves: outputs/manim_code/{topic}_slide_{num}.py
Render Animation
video_path = video_renderer.render_manim_animation(
code_path,
f " { topic } _slide_ { slide_num } "
)
# Executes: manim -qh code_path.py SceneName
# Output: outputs/manim_output/{scene}.mp4
Create Base Slide
base_slide = slide_renderer.create_slide_with_animation_placeholder(
slide[ 'title' ],
slide[ 'content_text' ],
slide_num,
topic
)
# Output: PNG with text on left, dark area on right
Store Composite Data
slide_paths[slide_num] = {
'type' : 'animation_composite' ,
'base_slide' : base_slide,
'animation' : video_path
}
Branch 2: Image Slides (50-80%, portion):
Fetch Image
image_path = image_fetcher.fetch_image(
slide[ 'image_keyword' ],
slide_num,
topic
)
# Calls Unsplash API, downloads to outputs/images/
Composite with Text
slide_with_img = slide_renderer.create_slide_with_image(
slide[ 'title' ],
slide[ 'content_text' ],
image_path,
slide_num,
topic
)
# Output: PNG with text on left, image on right
Store Path
slide_paths[slide_num] = slide_with_img
Branch 3: Text-Only Slides (50-80%, portion):
text_slide = slide_renderer.create_text_slide(
slide[ 'title' ],
slide[ 'content_text' ],
slide_num,
topic
)
slide_paths[slide_num] = text_slide
# Output: PNG with centered title and content
Progress Breakdown (app.py:435-438):
print ( f " \n 📊 Final visual breakdown:" )
print ( f " Animations: { len (animation_paths) } " )
print ( f " Images: { len (image_paths) } " )
print ( f " Text-only: { total_slides - len (animation_paths) - len (image_paths) } " )
Stage 6: Video Composition (85-95%)
Duration : 30-90 seconds
Location : backend/app.py:441-451
Composer : VideoComposer (video_composer.py)
Process :
Load Slide Clips
for slide in content_data[ 'slides' ]:
duration = slide_script[ 'end_time' ] - slide_script[ 'start_time' ]
if isinstance (slide_data, dict ) and slide_data[ 'type' ] == 'animation_composite' :
slide_clip = composer.composite_animation_on_slide(
slide_data[ 'base_slide' ],
slide_data[ 'animation' ],
duration
)
else :
slide_clip = composer.create_slide_video(slide_data, duration)
Animation Compositing
For animation slides: # Load base slide (PNG) and animation (MP4)
slide_clip = ImageClip(slide_image_path, duration = duration)
animation_clip = VideoFileClip(animation_video_path)
# Adjust animation duration
if animation_clip.duration < duration:
# Loop animation
num_loops = int (duration / animation_clip.duration) + 1
animation_adjusted = concatenate_videoclips([animation_clip] * num_loops)
animation_adjusted = animation_adjusted.subclipped( 0 , duration)
# Resize and position
animation_final = animation_adjusted.resized( new_size = ( 850 , 700 ))
animation_final = animation_final.with_position(( 1010 , 250 ))
# Composite
composite = CompositeVideoClip(
[slide_clip, animation_final],
size = ( 1920 , 1080 )
)
Concatenate Slides
final_video = concatenate_videoclips(slide_clips, method = "compose" )
Add Audio
audio = AudioFileClip(audio_path)
final_video = final_video.with_audio(audio)
Render Final MP4
final_video.write_videofile(
str (output_path),
fps = 30 ,
codec = 'libx264' ,
audio_codec = 'aac' ,
preset = 'medium' ,
bitrate = '5000k' ,
audio_bitrate = '192k'
)
# Output: outputs/final/{topic}_final.mp4
Timing Validation (video_composer.py:289-290):
if abs (final_video.duration - audio.duration) > 0.5 :
print ( f "⚠️ Warning: Video ( { final_video.duration :.1f} s) doesn't match audio ( { audio.duration :.1f} s)" )
Stage 7: Completion (100%)
Duration : Instant
Location : backend/app.py:453-470
Tasks :
Extract Filename
video_filename = Path(final_video_path).name
# e.g., "Newtons_Third_Law_final.mp4"
Update Progress
update_progress(generation_id, 100 , "completed" , "✅ Video generation complete!" )
Return Response
return GenerateResponse(
status = "success" ,
message = "Presentation video generated successfully" ,
content_data = content_data,
script_data = script_data,
video_path = final_video_path,
video_filename = video_filename
)
Frontend Transition :
if ( response . data . status === "success" ) {
const generatedData = {
content: response . data . content_data ,
script: response . data . script_data ,
videoPath: response . data . video_path ,
videoFilename: response . data . video_filename
};
onGenerationComplete ( generatedData );
}
Error Handling & Fallbacks
Content Generation Errors
try :
content_data = content_gen.generate_content(topic, num_slides)
except Exception as e:
error_msg = f "Error: { str (e) } "
print ( f "Full error: \n { traceback.format_exc() } " )
update_progress(generation_id, 0 , "error" , f "❌ { error_msg } " )
raise HTTPException( status_code = 500 , detail = error_msg)
Common Issues :
Gemini API timeout → Retry with exponential backoff
Invalid JSON response → Clean and re-parse
Missing fields → Add defaults and continue
Audio Generation Errors
try :
audio_path = voice_gen.generate_voice_for_slide( ... )
except Exception as e:
print ( f "Error generating audio for slide { slide_num } : { e } " )
# Use estimated duration from content
actual_durations[slide_num] = slide_script[ 'end_time' ] - slide_script[ 'start_time' ]
Fallback : Continue without audio for that slide (silent)
Image Fetch Errors
try :
image_path = image_fetcher.fetch_image(keyword, slide_num, topic)
if not image_path:
raise ValueError ( "Image fetch returned empty path" )
except Exception as e:
print ( f "❌ Error fetching image for slide { slide_num } : { e } " )
# Fallback to text-only slide
text_slide = slide_renderer.create_text_slide( ... )
slide_paths[slide_num] = text_slide
Animation Generation Errors
try :
animation_code = manim_gen.generate_animation_code(slide, duration)
video_path = video_renderer.render_manim_animation(code_path, scene_name)
except Exception as e:
print ( f "❌ Error generating animation for slide { slide_num } : { e } " )
traceback.print_exc()
# Fallback to text-only slide
text_slide = slide_renderer.create_text_slide( ... )
slide_paths[slide_num] = text_slide
Common Manim Errors :
Syntax error in generated code → Show error, fallback to text
Rendering timeout → Skip animation, use text slide
Missing dependencies → Warning in logs, text fallback
Progress Tracking
Progress Percentages
Stage Start % End % Duration Status ID Initialization 0 10 1s startedContent Generation 10 20 10-30s generating_contentScript Generation 20 30 10-20s generating_scriptsAudio Generation 30 48 30-60s generating_audioAudio Combining 48 49 2-5s combining_audioVisual Generation 50 80 60-180s generating_media- Animation 50-80 (portion) varies generating_animation- Images 50-80 (portion) varies fetching_image- Text Slides 50-80 (portion) varies generating_slideVideo Composition 85 95 30-90s composing_videoCompletion 95 100 instant completed
Real-time Updates
Backend (app.py:52-61):
def update_progress ( generation_id : str , progress : int , status : str , message : str ):
timestamp = datetime.now().strftime( "%H:%M:%S" )
generation_status[generation_id] = {
"status" : status,
"progress" : progress,
"message" : message,
"timestamp" : timestamp
}
print ( f "[ { timestamp } ] { message } " )
Frontend (useSSEProgress.jsx):
eventSource . onmessage = ( event ) => {
const data = JSON . parse ( event . data );
setProgress ( data . progress );
setStatus ( data . status );
setMessage ( data . message );
};
Total Time by Slide Count
Times are approximate and vary based on:
Gemini API response time
Number of animations (slowest step)
Audio length
System performance
Slides Text-Only With Images With Animations Total 3 1.5 min 2 min 3.5 min ~2-3 min 5 2 min 2.5 min 4.5 min ~3-5 min 10 3 min 4 min 7 min ~5-8 min
Bottlenecks
Manim Rendering (30-60s per animation)
Solution: Limit animations to 1-2 per presentation
Alternative: Pre-render common animations
Gemini API Calls (10-30s per call)
Solution: Use streaming responses (future)
Cache: Store common content patterns
Video Composition (30-90s)
Depends on: Total video length, number of clips
Optimization: Use GPU acceleration if available
Next Steps
API Reference Explore detailed API documentation
Troubleshooting Common issues and solutions
Backend Architecture Understand the backend structure
Frontend Architecture Understand the frontend structure