Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

50citations

Citations

Authors

Data Points

Authors

Lanqing Guo Yingqing He Haoxin Chen Menghan Xia Xiaodong Cun Yufei Wang Siyu Huang Yong Zhang Xintao Wang Qifeng Chen Ying Shan Bihan Wen

Abstract

Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a 5× training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

Citation History

Jan 26, 2026

49+49

Jan 27, 2026

50+1