ad_placement analysis from step 2, and the product_name from step 1, then produces a merged video with the ad seamlessly inserted at the chosen timestamp.
The full pipeline is orchestrated by generate_edited_video() in app/video_ad_integration.py.
Step 1: Frame extraction
Splyce uses ffmpeg to extract a single still frame from the original video. The extraction timestamp is calculated as:AD_SEGMENT_DURATIONdefaults to3secondsAD_FRAME_OFFSET_RATIOdefaults to0.2(20% into the ad window)
12.5 + (3 × 0.2) = 13.1 seconds. This offset places the frame slightly after the cut point, giving a more representative frame for editing.
Step 2: Frame editing
generate_edited_frame() sends the extracted frame to a Gemini image generation model along with:
- The
product_name - The
ad_descriptionfrom the placement analysis - The
edit_instructionspecifying the exact body location
Model fallback order
Splyce tries image models in this order:gemini-3.1-flash-image-previewgemini-2.5-flash-image- Text overlay fallback (
overlay_product_label_on_image)
Product placement logic
The placement target is derived fromad_description.visual:
- Watches → placed on a wrist
- Phones → placed in a hand
- Other products → worn or held according to Gemini’s interpretation of the scene
Step 3: Voiceover generation
generate_voiceover() calls ElevenLabs TTS to synthesize a fixed voiceover line constructed from the product name:
Voiceover line: "Oh wow, a {product_name}." — the product_name value is substituted at generation time.
If use_cloned_voice is enabled (default), ElevenLabs uses Instant Voice Clone with a reference audio file. The default reference is wolf_voice.mp4, configurable via the VOICE_REFERENCE_PATH environment variable. The cloned voice matches the character’s speaking style, making the voiceover blend with the surrounding dialogue.
Voice cloning requires a valid ElevenLabs API key with access to the Instant Voice Clone feature. If cloning fails or
use_cloned_voice is false, ElevenLabs falls back to a standard pre-built voice specified by voice_id.Step 4: Ad segment assembly
build_ad_segment() uses ffmpeg to combine the edited frame and voiceover audio into a 3-second video segment (AD_SEGMENT_DURATION). The frame is held as a still image for the full duration while the voiceover plays.
The segment is written to a temporary file and used directly in the splice step.
Step 5: Video splicing
splice_video() inserts the 3-second ad segment into the original video at ad_timestamp_seconds using ffmpeg’s filter_complex concat filter. The operation replaces exactly the 3-second window starting at the target timestamp — no frames from the original video are preserved within that window.
Natural cut point detection
Before splicing, Splyce uses ffprobe to scan for the nearest I-frame (keyframe) within ±1.5 seconds of the target timestamp. Cutting at an I-frame avoids visual artifacts that occur when splitting at non-keyframe positions in H.264/H.265 streams. If no I-frame is found within the ±1.5 second window, Splyce falls back to the exactad_timestamp_seconds value.
Output
The merged video is written to the server’s output directory and served at:/api/generate-ad-video response includes the full URL to the merged video file.