Part 3/12:
Users can now upload a reference image and provide a description of their scene to receive 5 to 10-second clips rendered at 720p, where subjects maintain coherence throughout. Achieving smooth camera motion, multi-shot consistency, and character uniformity marks a leap forward from merely generating isolated clips to building comprehensive narratives.
The technical upgrade stems from a shift in processing: while earlier models treated frames independently—resulting in flickering or morphing—Gen 4 approaches video as a unified scene using an internal model that retains visual information across frames. This innovative method draws on the concept of world modeling, enhancing its temporal system and reference conditioning.