Gen_shorts

Project
Exploring Multi-modal LLM
Gen_shorts

An end-to-end pipeline that takes a user prompt and produces a narrated short-form video by chaining together multiple AI models.
The core challenge was orchestrating asynchronous jobs across Stable Diffusion (image generation), a TTS model (narration), and an ffmpeg-based video compositor. Each stage has variable latency, so jobs are queued and tracked using Convex's reactive backend mutations update a job state machine that the frontend subscribes to in real time, avoiding polling.
A single user prompt needs to be split into coherent scene descriptions before image generation. Used a structured output LLM call with strict JSON schema validation to produce scene breakdowns, with retry logic for malformed responses.
Synchronizing generated images with TTS audio segments required frame-accurate timing. Audio duration is computed first, then used to derive per-scene image display durations before compositing.
Clerk handles authentication and is integrated with Convex via JWT validation on the backend, gating generation jobs behind user identity without running a separate auth service.
Multi-modal AI pipeline orchestrated through Convex (BaaS). Script, image, and audio generation run in parallel after auth; the media assembler merges outputs into the final short clip.
Made with