TikTok Captions

The caption stack that performs on TikTok: an opening title lands with a Difference blend motion effect, then each spoken line gets its own font, color treatment, and word-animation style, timed to word-level timestamps from transcription. One video layer carries the audio; caption elements swap in and out on the timeline. Features used: difference motion, word_animation (color, box, glow)

{
  "width": 1080,
  "height": 1920,
  "elements": [
    {
      "type": "video",
      "id": "vid1",
      "source_url": "https://cdn-assets.framelane.io/shared/videos/clip1.mp4",
      "volume": 50
    },
    {
      "type": "text",
      "id": "t1",
      "text": "Albert",
      "font_family": "Alfa Slab One",
      "text_color": "#FFFFFF",
      "font_size": 260,
      "x": "50%",
      "y": "70%",
      "time": 0,
      "duration": 4.193,
      "motion": [
        {
          "type": "difference",
          "time": 0,
          "duration": 4.193
        }
      ]
    },
    {
      "type": "text",
      "id": "t2",
      "text": "When I came to you with those calculations",
      "font_family": "Komika Axis",
      "text_color": "#FFFFFF",
      "stroke_color": "#000000",
      "stroke_width": 0.1,
      "font_size": 100,
      "background_color": "#850000",
      "x": "50%",
      "y": "75%",
      "time": 4.496,
      "duration": 2.493,
      "word_animation": {
        "style": "color",
        "words": [
          { "text": "When", "start": 4.496, "end": 4.577 },
          { "text": "I", "start": 4.593, "end": 4.69 },
          { "text": "came", "start": 4.69, "end": 5.014 },
          { "text": "to", "start": 5.014, "end": 5.095 },
          { "text": "you", "start": 5.095, "end": 5.24 },
          { "text": "with", "start": 5.24, "end": 5.37 },
          { "text": "those", "start": 5.58, "end": 5.871 },
          { "text": "calculations,", "start": 5.984, "end": 6.793 }
        ]
      }
    },
    {
      "type": "text",
      "id": "t3",
      "text": "we thought we might start a chain reaction that would destroy the entire world",
      "font_family": "Bebas Neue",
      "text_color": "#FFFFFF",
      "font_size": 100,
      "background_color": "#47008E",
      "x": "50%",
      "y": "75%",
      "time": 7.682,
      "duration": 6.828,
      "word_animation": {
        "style": "box",
        "words": [
          { "text": "we", "start": 7.682, "end": 7.828 },
          { "text": "thought", "start": 7.828, "end": 8.07 },
          { "text": "we", "start": 8.07, "end": 8.183 },
          { "text": "might", "start": 8.183, "end": 8.491 },
          { "text": "start", "start": 8.491, "end": 8.814 },
          { "text": "a", "start": 8.814, "end": 8.895 },
          { "text": "chain", "start": 8.895, "end": 9.202 },
          { "text": "reaction", "start": 9.202, "end": 9.752 },
          { "text": "that", "start": 9.768, "end": 9.946 },
          { "text": "would", "start": 9.946, "end": 10.043 },
          { "text": "destroy", "start": 10.043, "end": 11.757 },
          { "text": "the", "start": 11.822, "end": 11.968 },
          { "text": "entire", "start": 11.968, "end": 12.226 },
          { "text": "world.", "start": 12.938, "end": 14.363 }
        ]
      }
    },
    {
      "type": "text",
      "id": "t4",
      "text": "I remember it well. What of it?",
      "font_family": "Lemon",
      "text_color": "#FFFFFF",
      "background": true,
      "background_color": "#000000",
      "background_opacity": 70,
      "x_padding": "3%",
      "y_padding": "1.5%",
      "font_size": 100,
      "x": "50%",
      "y": "75%",
      "time": 14.70,
      "duration": 2.719,
      "word_animation": {
        "style": "glow",
        "words": [
          { "text": "I", "start": 14.703, "end": 14.719 },
          { "text": "remember", "start": 14.719, "end": 15.205 },
          { "text": "it", "start": 15.205, "end": 15.367 },
          { "text": "well.", "start": 15.367, "end": 15.707 },
          { "text": "What", "start": 16.743, "end": 16.985 },
          { "text": "of", "start": 17.05, "end": 17.163 },
          { "text": "it.", "start": 17.163, "end": 17.406 }
        ]
      }
    },
    {
      "type": "text",
      "id": "t5",
      "text": "I belive we did!",
      "font_family": "Poppins",
      "text_color": "#FFFFFF",
      "stroke_color": "#000000",
      "background_color": "#FF5000",
      "stroke_width": 0.1,
      "shadow_color": "#000000",
      "shadow_x": "5%",
      "shadow_y": "8%",
      "font_size": 120,
      "x": "50%",
      "y": "75%",
      "time": 20.74,
      "duration": 3.593,
      "word_animation": {
        "style": "color",
        "words": [
          { "text": "I", "start": 21.74, "end": 21.844 },
          { "text": "believe", "start": 21.844, "end": 22.123 },
          { "text": "we", "start": 22.123, "end": 22.21 },
          { "text": "did.", "start": 22.21, "end": 22.437 }
        ]
      }
    }
  ]
}

How this request is structured

One video, many caption lines. The video element carries the footage and audio (volume: 50). Each caption is a separate text element with its own time and duration — they appear sequentially, not all at once. Opening title uses blend motion. The first text element ("Albert") has no word_animation. Instead it uses a motion preset with "type": "difference" — the After Effects Difference blend mode, which inverts against the video backdrop per channel (|backdrop − color|). Set motion[].time and motion[].duration to match the element’s time and duration so the effect runs for the full title window. Give each caption a unique id. Element ids must be unique — the API rejects a request with duplicate ids (422). Each caption here is a separate element (t1–t5) with its own time/duration window, so they still appear sequentially, one after another. Word timestamps are absolute. Each word’s start and end are seconds from the start of the composition, not relative to the text element’s time. See Word Animation Examples for details. Mix styles per line. After the opening title, this reel cycles through three karaoke styles:

Line	Effect	Caption treatment
”Albert”	`motion` → `difference` blend	Large display type, no word animation
”When I came to you…”	`word_animation` → `color`	Stroke + `background_color` as highlight color
”we thought we might…”	`word_animation` → `box`	Colored box behind the active word
”I remember it well…”	`word_animation` → `glow`	Semi-transparent background bar, active word at full opacity
”I belive we did!”	`word_animation` → `color`	Stroke, shadow, and orange highlight

Getting word timestamps

Word timestamps come from a speech-to-text / forced alignment pipeline. Common sources:

WhisperX — word-level alignment on top of Whisper transcriptions
AssemblyAI / Deepgram — both return word-level timestamps in their transcription API response
The FrameLane Transcribe task returns word timestamps directly usable in word_animation.words

Get Started

Renders

Render Examples

Recipes

Tasks

Webhooks

SDKs

How this request is structured

Getting word timestamps

​How this request is structured

​Getting word timestamps

How this request is structured

Getting word timestamps