On April 21, the SkyReels team from Kunlun Wanwei officially launched and open-sourced SkyReels-V2, the world's first infinite-duration movie generation model based on a diffusion-forced framework. This breakthrough was achieved through the synergistic optimization of multimodal large language models, multi-stage pre-training, reinforcement learning, and the diffusion-forced framework.
Over the past year, video generation technology has made significant progress driven by diffusion models and autoregressive frameworks. However, major challenges remain in prompt following, visual quality, motion dynamics, and video length coordination. Current technologies often sacrifice motion effects when improving visual quality, limit video duration for higher resolution, and lack sufficient lens-aware generation capabilities. These limitations have hindered the realistic synthesis of long videos and the creation of professional cinematic styles.
To address these issues, SkyReels-V2 was developed. The model can generate 30-second and 40-second videos with high motion quality, consistency, and fidelity. Its innovative technologies include:
- SkyCaptioner-V1: A comprehensive film-grade video understanding model that combines general descriptions from multimodal LLMs with detailed shot languages from sub-expert models using structured video representation methods to enhance prompt-following capabilities. The model is already open-sourced and ready for direct use.
- Motion Preference Optimization: Through reinforcement learning training using human annotations and synthetic distortion data, it solves problems like dynamic distortion and unreasonable movements while designing a semi-automated data collection pipeline to reduce annotation costs.
- Efficient Diffusion-Forced Framework: Proposes a diffusion-forced post-training method that fine-tunes pre-trained diffusion models to cut training costs, improve generation efficiency, and achieve efficient long-video production.
- Progressive Resolution Pre-training and Multi-stage Post-training Optimization: Integrates general datasets, self-collected media, and art resource libraries to enhance model performance through progressive resolution pre-training and multi-stage post-training strategies.
To comprehensively evaluate SkyReels-V2's performance, SkyReels-Bench was built for human assessments, alongside automated evaluations using the open-source V-Bench. Evaluation results show that SkyReels-V2 excels in instruction-following, motion quality, consistency, and visual quality, outperforming other baseline models.
SkyReels-V2 not only achieves technical breakthroughs but also supports multiple practical application scenarios:
- Story Generation: Capable of producing theoretically infinite-duration videos, ensuring continuity and visual consistency through sliding window methods and stabilization techniques, suitable for filmmaking and advertisement creation.
- Image-to-Video Synthesis: Provides two generation methods by fine-tuning full-sequence text-to-video diffusion models or combining diffusion-forced models with frame conditions to achieve high-quality image-to-video generation.
- Camera Director Functionality: Performs excellently in annotating camera movements, improving cinematographic effects through sample filtering and fine-tuning experiments, ensuring smoothness and diversity in camera motions.
- Element-to-Video Generation: Developed the SkyReels-A2 solution, proposing multi-element-to-video tasks that combine arbitrary visual elements into coherent videos with high fidelity, perfect for short dramas, music videos, and virtual e-commerce content creation.
The launch of SkyReels-V2 marks a new phase in video generation technology, providing an entirely new solution for high-quality, long-duration cinematic-style video generation. The SkyReels team at Kunlun Wanwei has fully open-sourced various sizes of SkyCaptioner-V1 and SkyReels-V2 series models, promoting research and applications in academia and industry. Moving forward, the team will continue optimizing SkyReels-V2’s performance, exploring more application scenarios, reducing computational costs, and driving the widespread adoption of video generation technology.