Gen_shorts

Generate short stories or podcasts

Project

Exploring Multi-modal LLM

Gen_shorts

Automatic Shorts Generation

Overview

An end-to-end pipeline that takes a user prompt and produces a narrated short-form video by chaining together multiple AI models.

Technical Challenges

Multi-modal Generation Pipeline

The core challenge was orchestrating asynchronous jobs across Stable Diffusion (image generation), a TTS model (narration), and an ffmpeg-based video compositor. Each stage has variable latency, so jobs are queued and tracked using Convex's reactive backend mutations update a job state machine that the frontend subscribes to in real time, avoiding polling.

Prompt-to-Storyboard Decomposition

A single user prompt needs to be split into coherent scene descriptions before image generation. Used a structured output LLM call with strict JSON schema validation to produce scene breakdowns, with retry logic for malformed responses.

Asset Stitching

Synchronizing generated images with TTS audio segments required frame-accurate timing. Audio duration is computed first, then used to derive per-scene image display durations before compositing.

Auth & Entitlements

Clerk handles authentication and is integrated with Convex via JWT validation on the backend, gating generation jobs behind user identity without running a separate auth service.

System Architecture

Frontend / ClientBackend ServicesExternal APIs / AIDatabases / StorageInfrastructure

Multi-modal AI pipeline orchestrated through Convex (BaaS). Script, image, and audio generation run in parallel after auth; the media assembler merges outputs into the final short clip.

Made with

ConvexClerkStable DiffusionGenerative AI

Try now Return