Rajat Dhoot

Note: We've mirrored this post from Substack to provide easy access alongside other content. While we strive for accuracy, the original Substack post may offer richer formatting or embedded media. For the definitive reading experience, please visit the original article.

1. Introduction

KukuFM is India’s leading entertainment platform, producing thousands of audio shows every single day. While audio has always been our core, our community was clamoring for video. But converting an immense, multilingual audio library into engaging video manually would take years. Instead, we built an in-house AI video generation pipeline. In this post, I’ll walk you step-by-step through how we turned raw audio into fully rendered videos using AWS Transcribe, scene chunking, Leonardo.ai for visuals, and MoviePy for final assembly.

2. Why Video, Why Now?

KukuFM already owns one of India’s largest vernacular audio libraries 100 million listeners.
Yet, new-to-internet audiences are spending most of their screen time on short-video apps. If we want our stories to travel further, they need a visual form.

Evaluating “audio-to-video” SaaS vs. Building In-House

“Off-the-shelf” SaaS

English-first pipelines
More than 2$ per minute
Black-box models

Our In-House Pipeline

Supports 7+ Indian languages
< 0.30$ cost per minute
Full control over prompts, styling & QC

3. Background: From Audio to Video

Founded as an audio-only platform, KukuFM scaled to millions of listeners by focusing on regional storytelling (Hindi, Tamil, Malayalam, Telugu, and more).
As video consumption exploded, we quickly realized:
- 🔊 Audio has the story
- 🎥 Video has the engagement

Our goal was clear: “Convert every audio show into video automatically.”

4. Core Challenges

Multilingual Content
- English models outperform regional‐language models; we needed extra preprocessing for Hindi, Tamil, etc.
Character Consistency
- Visually representing the same narrator/character across scenes required a “character reference” workflow.
Long‐Form Narratives
- Properly chunking 20–60 minute audios into coherent 5–15 second video scenes took careful LLM prompting + post‐processing.

5. Our AI Video Generation Pipeline

flowchart LR
  A(Audio Files) --> B(Transcription)
  B --> C(Scene Segmentation)
  C --> D(Text Normalization)
  D --> E(Generate Scene JSON)
  E --> F(Image Generation @Leonardo)
  F --> G(Video Stitching @MoviePy)
  G --> H(Final MP4)

Step 1: Audio Transcription

We rely on AWS Transcribe for both text and timestamps. It handles English superbly; regional languages require extra care.

Workflow

Upload audio to S3
Kick off a Transcribe job
Store output JSON

Key Considerations

Use vocabulary filters to boost accuracy for proper names and regional idioms.
Set ShowSpeakerLabels when multiple narrators appear.

import boto3, uuid, time

transcribe = boto3.client('transcribe', region_name='us-east-1')

def transcribe_audio(bucket, audio_key, language_code='hi-IN'):
    job_name = f"transcription-{uuid.uuid4()}"
    media_uri = f"s3://{bucket}/{audio_key}"
    role_arn = "arn:aws:iam::123456789012:role/TranscribeS3AccessRole"

    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': media_uri},
        MediaFormat=audio_key.split('.')[-1],
        LanguageCode=language_code,
        OutputBucketName=bucket,
        Settings={
            'ShowSpeakerLabels': True,
            'MaxSpeakerLabels': 2,
            'VocabularyFilterName': 'RegionalTermsFilter',
            'VocabularyFilterMethod': 'mask'
        },
        JobExecutionSettings={'DataAccessRoleArn': role_arn}
    )

    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        time.sleep(10)

    return status['TranscriptionJob']['Transcript']['TranscriptFileUri']

# Example usage
transcript_uri = transcribe_audio('kukufm-audio-bucket', 'episode123.mp3')
print("Transcript available at:", transcript_uri)

IAM role details: The TranscribeS3AccessRole needs s3:GetObject, s3:PutObject, transcribe:StartTranscriptionJob, and transcribe:GetTranscriptionJob permissions.

Step 2: AI-Powered Character Extraction

First, let the LLM “read” your entire transcript and spit back a list of characters (with names and image-prompt descriptions). You’ll use these names both to tag speaking parts and to look up the right reference IDs when calling Leonardo.ai.

# 1. Extract characters from the full script
characters = analyze_characters(script_text)
# Example result:
# [
#   {"name": "Dhiren", "prompt": "Dhiren is a confident middle-aged man..."},
#   {"name": "Priya",  "prompt": "Priya is a young woman with..."},
#   ...
# ]
character_names = [c["name"] for c in characters]

Step 3: AI-Powered Scene Segmentation & Chunking

Split your word-level transcript (with AWS Transcribe timestamps) into ~500-word chunks and feed each chunk plus character_names into your “scene analyzer.” The model will:

🔍 Preserve every word exactly (no edits)
⏱ Enforce ≤12 s per scene (~30 words max)
🎭 Tag each scene with one of your known characters (or leave blank for environment)
🎨 Produce a ≥50-word visual description (camera, lighting, setting, etc.)

# 2. Break the transcript into 500-word chunks
chunks = chunk_transcript(transcript_chars, size=500)

all_scenes = []
for chunk in chunks:
    scenes = analyze_scenes(
        transcript_chunk=chunk,
        character_list=character_names
    )
    all_scenes.extend(scenes)

Here’s a tightened, de-duplicated flow—each step has its own focus, and you’ll only see the scene-object JSON spec once in Step 5:

Step 2: AI-Powered Character Extraction

Let the LLM “read” your entire transcript and return a list of characters (with names + image-prompt descriptions). You’ll use these names to tag speaking parts and to look up the right reference IDs in Leonardo.ai.

# 1. Extract characters from the full script
characters = analyze_characters(script_text)
# Example:
# [
#   {"name": "Dhiren", "prompt": "Dhiren is a confident middle-aged man…"},
#   {"name": "Priya",  "prompt": "Priya is a young woman with…"},
#   …
# ]
character_names = [c["name"] for c in characters]

Step 3: AI-Powered Scene Segmentation & Chunking

Break your word-level transcript (with AWS Transcribe timestamps) into ~500-word chunks and feed each chunk plus character_names into your “scene analyzer.” The model will:

🔍 Preserve every word exactly (no edits)
⏱ Enforce ≤12 s per scene (~30 words max)
🎭 Tag each scene with one of your known characters (or leave blank for environment)
🎨 Generate a ≥50-word visual description (camera, lighting, setting, action)

# 2. Chunk the transcript
chunks = chunk_transcript(transcript_chars, size=500)

all_scenes = []
for chunk in chunks:
    scenes = analyze_scenes(
        transcript_chunk=chunk,
        character_list=character_names
    )
    all_scenes.extend(scenes)

Step 4: Text Normalization & Timestamp Alignment

Normalize problematic Unicode so subtitles never glitch, then map each scene’s voiceover back to AWS Transcribe’s timestamps to compute perfect durations.

NORMALIZE = {
    'ड': 'ड़', 'ढ': 'ढ़', 'क': 'क़', 'ख': 'ख़',
    'ग': 'ग़', 'ज': 'ज़', 'फ': 'फ़', 'य': 'य़'
}

def normalize_text(s):
    return ''.join(NORMALIZE.get(ch, ch) for ch in s)

# Example:
# Original:   "क्विज़ के नियम बताए गए हैं"
# Normalized: "क़्विज़ के नियम बताए गए हैं"

For each scene:

Find the first/last word indices in the transcript.
Calculate duration = end_time_last – start_time_first.
If there’s a gap before the next word, add it as a “pause” and prefix the subtitle with “⏸️ ”.

Step 5: Scene-Object Structure

After chunking and timing, each scene is represented as:

[ // only shown once!
  {
    "id": "UUID-string",
    "scene": "Short title",
    "visual": "Detailed Leonardo.ai prompt",
    "duration": 8.5,
    "character": "Dhiren",
    "voiceover": "मुझे अभी अभी पता चला है…"
  },
  {
    "id": "another-UUID",
    "scene": "Environment shot",
    "visual": "Wide shot of a busy street at sunset…",
    "duration": 6.2,
    "character": "",
    "voiceover": "…"
  }
]

Step 6: Image Generation with Leonardo.ai

Once you have your scenes array and your list of character references (each with its Leonardo-id character generated with leonardo and metadata)

prompt: the scene["visual"] text
init_characters_id: the full list of character records you fetched/created earlier
current_character: the scene’s character name (so the service can pick the right reference ID)

Then you poll for the final URL or you can use webhook also

from app.services.leonardo_service import LeonardoAIService
import os, asyncio

async def render_scenes(scenes: List[Dict], character_refs: List[Dict]):
    # 1. Instantiate the service
    leo = LeonardoAIService(api_key=os.getenv("LEONARDO_API_KEY"))

    frames = []

    # 2. For each scene, kick off a generation
    for scene in scenes:
        gen = await leo.generate_image(
            prompt=scene["visual"],
            width=1080,
            height=1920,
            # ← your list of all character entries from step 2
            init_charatcers_id=character_refs,  
            # ← tells Léo which character to focus on
            current_character=scene["character"]
        )

        # 3. Poll until the image is ready
        result = await leo.poll_generation_status(gen["generation_id"])
        image_url = result["url"]

        frames.append({
            "scene_id": scene["id"],
            "character_ref_id": gen["saved_init_id"],  # the Leonardo reference we used
            "url": image_url
        })

    return frames

# Example usage
frames = asyncio.run(render_scenes(all_scenes, characters))

init_charatcers_id is your list of { id, name, metadata } objects from your initial character-extraction step.
current_character ensures Leonardo picks the right face/style each time.
saved_init_id is returned so you know which reference it actually used.

With this in place, every time your scene analyzer tags “Dhiren,” Leonardo.ai will consistently reuse that same Dhiren appearance—no drift, no guesswork.

Step 6: Stitching Everything with MoviePy

from moviepy.editor import (
    ImageClip, AudioFileClip, TextClip,
    CompositeVideoClip, concatenate_videoclips
)
from concurrent.futures import ThreadPoolExecutor

def make_clip(scene):
    img = ImageClip(f"frames/{scene['id']}.png").set_duration(scene['duration'])
    audio = AudioFileClip(f"audio_chunks/{scene['id']}.mp3")
    subtitle = (
        TextClip(scene['voiceover'], fontsize=24, method='caption', align='center')
        .set_duration(scene['duration'])
        .set_position(('center', 'bottom'))
    )
    return CompositeVideoClip([img, subtitle]).set_audio(audio)

with ThreadPoolExecutor(max_workers=8) as pool:
    clips = list(pool.map(make_clip, scenes))

final = concatenate_videoclips(clips, method="compose")
final.write_videofile(
    "output/final_video.mp4",
    codec="libx264",
    fps=24,
    threads=8,
    preset="medium"
)

6. Tech Stack & Infrastructure

Transcription: AWS Transcribe
LLM & Chunking: OpenAI, Claude, Gemini
Normalization & Utilities: Python 3.10
Image Generation: Leonardo.ai (character referencing)
Video Assembly: MoviePy + FFmpeg
Storage: AWS S3 + DynamoDB (scene metadata)
Orchestration: AWS Step Functions

7. Lessons Learned & Best Practices

Pre-define vocabularies for regional names to boost Transcribe accuracy.
Monitor chunk durations programmatically don’t trust prompts alone.
Normalize aggressively to avoid subtitle mismatches.
Cache character references in Leonardo to speed up generation.
Parallelize audio slicing and frame creation to handle thousands of scenes per hour.

8. Future Improvements

👉 Adaptive scene length: Use audio pacing and emotion to vary shot durations.
👉 Dynamic text styling: Animate subtitles based on sentiment.
👉 Model fine-tuning: Train a custom transcription model for Indian languages.

9. Conclusion

By marrying AWS Transcribe with LLM-powered chunking, Unicode normalization, Leonardo.ai for visuals, and MoviePy for assembly, we’ve built a fully automated pipeline that turns pure audio into polished videos—at scale. Whether you’re a podcast network, an edtech platform, or an entertainment studio, this approach can unlock new audiences and breathe life into your audio archives.

Ready to try it?
Grab your audio.
Follow the steps above.
Share your first AI-powered video!

Building AI-Powered Videos from Audio at KukuFM

1. Introduction

2. Why Video, Why Now?

3. Background: From Audio to Video

4. Core Challenges

5. Our AI Video Generation Pipeline

Step 1: Audio Transcription

Step 2: AI-Powered Character Extraction

Step 3: AI-Powered Scene Segmentation & Chunking

Step 2: AI-Powered Character Extraction

Step 3: AI-Powered Scene Segmentation & Chunking

Step 4: Text Normalization & Timestamp Alignment

Step 5: Scene-Object Structure

Step 6: Image Generation with Leonardo.ai

Step 6: Stitching Everything with MoviePy

6. Tech Stack & Infrastructure

7. Lessons Learned & Best Practices

8. Future Improvements

9. Conclusion