HARIVO: Harnessing Text-to-Image Models for Video Generation

0
citations
#1582
in ECCV 2024
of 2387 papers
9
Top Authors
4
Data Points

Abstract

We present a method to create diffusion-based Video models from pretrained Text-to-Image (T2I) models, overcoming limitations of existing methods. We propose a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. We demonstrate superior performance through extensive experiments and comparisons.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 28, 2026
0