Research Alpha Leak - Rising Stars in Research

#1

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

CVPR 2025

858

citations

#2

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.

CVPR 2025

342

citations

#3

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou et al.

CVPR 2025

253

citations

#4

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

CVPR 2025

203

citations

#5

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Jingfeng Yao, Bin Yang, Xinggang Wang

CVPR 2025

159

citations

#6

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.

CVPR 2025

154

citations

#7

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025

142

citations

#8

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.

CVPR 2025

138

citations

#9

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran et al.

CVPR 2025

136

citations

#10

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao et al.

CVPR 2025

123

citations

#11

WonderWorld: Interactive 3D Scene Generation from a Single Image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann et al.

CVPR 2025

120

citations

#12

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang et al.

CVPR 2025

119

citations

#13

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

CVPR 2025

98

citations

#14

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He et al.

CVPR 2025

96

citations

#15

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee et al.

CVPR 2025

96

citations

#16

DEIM: DETR with Improved Matching for Fast Convergence

Shihua Huang, Zhichao Lu, Xiaodong Cun et al.

CVPR 2025

93

citations

#17

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.

CVPR 2025

92

citations

#18

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao et al.

CVPR 2025

89

citations

#19

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

Yuheng Ji, Huajie Tan, Jiayu Shi et al.

CVPR 2025

89

citations

#20

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang et al.

CVPR 2025

83

citations

CVPR

Top Papers in CVPR 2025

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

OmniGen: Unified Image Generation

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Navigation World Models

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

WonderWorld: Interactive 3D Scene Generation from a Single Image

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

FoundationStereo: Zero-Shot Stereo Matching

Transformers without Normalization

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

DEIM: DETR with Improved Matching for Fast Convergence

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

MLVU: Benchmarking Multi-task Long Video Understanding

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation