🧬Generative Models

Video Generation

Generating video content from various inputs

100 papers8,205 total citations

Compare with other topics

Feb '24 — Jan '26617 papers

Top Conferences

CVPR: 43 ICLR: 18 ECCV: 14 ICCV: 10 NeurIPS: 8 AAAI: 5

Top Papers

#1

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

ControlVideo: Training-free Controllable Text-to-video Generation

Yabo Zhang, Yuxiang Wei, Dongsheng jiang et al.

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Vikram Voleti, Chun-Han Yao, Mark Boss et al.

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

Photorealistic Video Generation with Diffusion Models

Agrim Gupta, Lijun Yu, Kihyuk Sohn et al.

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Linrui Tian, Qi Wang, Bang Zhang et al.

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang et al.

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang et al.

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang et al.

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta et al.

ECCV 2024arXiv:2409.18964

image-to-video generationrigid-body physicsphysics-grounded generationimage-space dynamics+4

104

citations

#14

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao et al.

ICLR 2025arXiv:2412.14169

autoregressive video generationtemporal frame predictionspatial set predictiongpt-style models+3

101

citations

#16

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie et al.

VidToMe: Video Token Merging for Zero-Shot Video Editing

Xirui Li, Chao Ma, Xiaokang Yang et al.

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang et al.

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

Real-Time Video Generation with Pyramid Attention Broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini et al.

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Ruoyu Feng, Wenming Weng, Yanhui Wang et al.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz et al.

One-Minute Video Generation with Test-Time Training

Jiarui Xu, Shihao Han, Karan Dalal et al.

TC4D: Trajectory-Conditioned Text-to-4D Generation

Sherwin Bahmani, Xian Liu, Wang Yifan et al.

ECCV 2024arXiv:2403.17920

text-to-4d generationtrajectory-conditioned generationdynamic 3d scenesneural representations+4

64

citations

#29

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

Koala: Key Frame-Conditioned Long Video-LLM

Reuben Tan, Ximeng Sun, Ping Hu et al.

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet et al.

Long Context Tuning for Video Generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang et al.

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025arXiv:2502.11079

subject-consistent video generationcross-modal alignmenttext-to-video architectureimage-to-video architecture+4

55

citations

#35

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai, Menghan Xia, Xintao WANG et al.

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang, Shengqu Cai, Muyang Li et al.

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan et al.

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng et al.

CVPR 2025arXiv:2412.06699

3d generation modelsmulti-view diffusion modelpose-free videoslarge-scale video data+4

49

citations

#40

OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

Han Liang, Jiacheng Bao, Ruichi Zhang et al.

GAIA: Zero-shot Talking Avatar Generation

Tianyu He, Junliang Guo, Runyi Yu et al.

Image Conductor: Precision Control for Interactive Video Synthesis

Yaowei Li, Xintao Wang, Zhaoyang Zhang et al.

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Yongwei Chen, Tengfei Wang, Tong Wu et al.

ECCV 2024arXiv:2403.12409

3d asset generationsingle-image 3d generationspatially-aware diffusion guidancescore distillation sampling+4

45

citations

#44

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Shuai Tan, Bin Ji, Ye Pan

AAAI 2024arXiv:2403.06365

talking head generationemotion style transferart style transferaudio-driven animation+4

43

citations

#45

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Hui Li, Mingwang Xu, Qingkun Su et al.

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2025arXiv:2501.06187

video personalizationdiffusion transformermulti-subject personalizationopen-set personalization+3

40

citations

#47

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak et al.

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu et al.

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Zhiyuan Yan, Yandan Zhao, Shen Chen et al.

Trajectory attention for fine-grained video motion control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou et al.

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi, Shihao Zhao, Xianbiao Qi et al.

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen, Prem Seetharaman, Bryan Russell et al.

CVPR 2025arXiv:2411.17698

video-guided sound generationmultimodal conditioningfoley sound synthesisaudio-visual synchronization+4

38

citations

#53

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Shanchuan Lin, Ceyuan Yang, Hao He et al.

DragVideo: Interactive Drag-style Video Editing

Yufan Deng, Ruida Wang, Yuhao ZHANG et al.

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou et al.

FreeVS: Generative View Synthesis on Free Driving Trajectory

Qitai Wang, Lue Fan, Yuqi Wang et al.

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Ke Fan, Junshu Tang, Weijian Cao et al.

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan et al.

ICLR 2025arXiv:2502.17258

diffusion modelsvideo editingattention mechanismmulti-grained editing+4

31

citations

#59

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan et al.

PREGO: Online Mistake Detection in PRocedural EGOcentric Videos

Alessandro Flaborea, Guido M. D&amp, #x27 et al.

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Santiago Pascual, Chunghsin YEH, Ioannis Tsiamas et al.

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

yifei xia, Suhan Ling, Fangcheng Fu et al.

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Zhe Kong, Feng Gao, Yong Zhang et al.

NeurIPS 2025arXiv:2505.22647

audio-driven human animationtalking head generationtalking body generationmulti-person video generation+3

30

citations

#64

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Xiaojuan Wang, Boyang Zhou, Brian Curless et al.

OmniViD: A Generative Framework for Universal Video Understanding

Junke Wang, Dongdong Chen, Chong Luo et al.

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Zongyi Li, Shujie HU, Shujie LIU et al.

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Shengqu Cai, Duygu Ceylan, Matheus Gadelha et al.

MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors

Qingming LIU, Yuan Liu, Jiepeng Wang et al.

Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Bojia Zi, Penghui Ruan, Marco Chen et al.

NeurIPS 2025arXiv:2502.06734

video generationvideo editing techniquesinversion-based methodsend-to-end methods+4

25

citations

#70

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

Xiaofeng Wang, Kang Zhao, Feng Liu et al.

NeurIPS 2025arXiv:2411.08380

egocentric video generationvideo-action datasetkinematic controlaction annotations+4

25

citations

#71

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

Huiyu Duan, Qiang Hu, Wang Jiarui et al.

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan, Xianyi He, Yufan Deng et al.

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang et al.

CVPR 2025arXiv:2412.15214

image-to-video synthesis3d trajectory controldrag-based interactionvideo diffusion model+3

25

citations

#74

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Wang Jiarui, Huiyu Duan, Guangtao Zhai et al.

AnimateAnything: Consistent and Controllable Animation for Video Generation

guojun lei, Chi Wang, Rong Zhang et al.

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Kun Su, Judith Li, Qingqing Huang et al.

AAAI 2024arXiv:2305.06594

video-to-music generationautoregressive modelvisual-audio correspondenceaudio codecs+4

23

citations

#77

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Xiang Fan, Anand Bhattad, Ranjay Krishna

Object-Centric Diffusion for Efficient Video Editing

Kumara Kahatapitiya, Adil Karjauv, Davide Abati et al.

ECCV 2024arXiv:2401.05735

diffusion-based video editingobject-centric samplingtoken mergingcomputational efficiency+4

22

citations

#79

OSV: One Step is Enough for High-Quality Image to Video Generation

Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang et al.

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Volodymyr Mnih, Aleksandra Faust et al.

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, YaoYang Liu, Bin Xia et al.

VideoMAC: Video Masked Autoencoders Meet ConvNets

Gensheng Pei, Tao Chen, Xiruo Jiang et al.

Generative Video Propagation

Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang et al.

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Yushu Wu, Zhixing Zhang, Yanyu Li et al.

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi et al.

Grid Diffusion Models for Text-to-Video Generation

Taegyeong Lee, Soyeong Kwon, Taehwan Kim

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen et al.

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Shan Mengyi, Lu Dong, Yutao Han et al.

Taming Teacher Forcing for Masked Autoregressive Video Generation

Deyu Zhou, Quan Sun, Yuang Peng et al.

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Hongxiang Zhao, Xingchen Liu, Mutian Xu et al.

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

Yunhan Yang, Yukun Huang, Xiaoyang Wu et al.

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang, Junliang Guo, Jianhong Bai et al.

AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta et al.

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Haoyu Zhao, Tianyi Lu, Jiaxi Gu et al.

MoST: Motion Style Transformer Between Diverse Action Contents

Boeun Kim, Jungho Kim, Hyung Jin Chang et al.

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

Chen Guo, Junxuan Li, Yash Kant et al.

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo et al.

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Zhuoman Liu, Weicai Ye, Yan Luximon et al.

CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation

Gaojie Lin, Jianwen Jiang, Chao Liang et al.

ICLR 2025

audio-driven generationtalking body generationdiffusion modelshuman animation+4

17

citations

#100

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang, Chengqi Duan, Kun Wang et al.

ICCV 2025

17

citations

Video Generation

Top Conferences

Related Topics (Generative Models)

Top Papers

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

VBench: Comprehensive Benchmark Suite for Video Generative Models

ControlVideo: Training-free Controllable Text-to-video Generation

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Video-P2P: Video Editing with Cross-attention Control

Photorealistic Video Generation with Diffusion Models

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

VideoBooth: Diffusion-based Video Generation with Image Prompts

Improving Video Generation with Human Feedback

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Autoregressive Video Generation without Vector Quantization

VideoPhy: Evaluating Physical Commonsense for Video Generation

VidToMe: Video Token Merging for Zero-Shot Video Editing

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Video ReCap: Recursive Captioning of Hour-Long Videos

General Object Foundation Model for Images and Videos at Scale

Real-Time Video Generation with Pyramid Attention Broadcast

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

MV-Adapter: Multi-View Consistent Image Generation Made Easy

History-Guided Video Diffusion

One-Minute Video Generation with Test-Time Training

TC4D: Trajectory-Conditioned Text-to-4D Generation

GameFactory: Creating New Games with Generative Interactive Videos

Koala: Key Frame-Conditioned Long Video-LLM

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Long Context Tuning for Video Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

GAIA: Zero-shot Talking Avatar Generation

Image Conductor: Precision Control for Interactive Video Synthesis

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Multi-subject Open-set Personalization in Video Generation

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Trajectory attention for fine-grained video motion control

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Video-Guided Foley Sound Generation with Multimodal Controls

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

DragVideo: Interactive Drag-style Video Editing

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

FreeVS: Generative View Synthesis on Free Driving Trajectory

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

PREGO: Online Mistake Detection in PRocedural EGOcentric Videos

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

OmniViD: A Generative Framework for Universal Video Understanding

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors

Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

AnimateAnything: Consistent and Controllable Animation for Video Generation

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation