z_index layering, word_animation (color), fade motion preset, text stroke & shadow
The layer order is the whole trick:
How the z-depth sandwich works
vid2 (clip4.mp4) is the full background clip at z_index: 0 — it carries the audio. vid1 is the same footage as a transparent WebM at z_index: 20 with volume: 0, so only the speaker silhouette composites on top. The giant "BOMB!" title sits at z_index: 10 between them, which is why the speaker appears to stand in front of the letters.
The title fades in at second 2 and fades out at second 5 using paired fade motion presets — the second entry sets reversed: true to dissolve out instead of in.
Karaoke captions on top
The bottom caption line usesword_animation.style: "color" with per-word timestamps from transcription. background_color on the text element sets the highlight color for the active word. At z_index: 30 it sits above the speaker cutout so captions stay readable.
Tips
Font size matters. The title needs to be large enough that the speaker visibly overlaps it —font_size: 320 at 3840px fills the frame. Scale down for 1080p or vertical formats.
Two video layers, one clip. Use the full MP4 for background + audio and a pre-matted WebM (or a second layer with remove_background) for the foreground cutout. Set the cutout layer to volume: 0 so audio isn’t doubled.
Logo sting. A short-lived image at the highest z_index with a reversed fade motion (reversed: true) dissolves the logo out in the first half-second without a hard cut.
Want AI background removal instead? Replace the WebM layer with a single video element using remove_background: true at z_index: 20. That triggers a separate pre-processing task before the render begins.
