CVPR

5,589 papers tracked across 2 years

Top Papers in CVPR 2025

View all papers →
#1

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

CVPR 2025
858
citations
#2

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev et al.

CVPR 2025
552
citations
#3

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.

CVPR 2025
342
citations
#4

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou et al.

CVPR 2025
253
citations
#5

Continuous 3D Perception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski et al.

CVPR 2025
236
citations
#6

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

CVPR 2025
203
citations
#7

MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu, Xinchao Wang

CVPR 2025
186
citations
#8

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Jingfeng Yao, Bin Yang, Xinggang Wang

CVPR 2025
159
citations
#9

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.

CVPR 2025
154
citations
#10

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025
144
citations
#11

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.

CVPR 2025
138
citations
#12

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran et al.

CVPR 2025
136
citations
#13

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao et al.

CVPR 2025
123
citations
#14

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu, Huichao Zhang, Yiheng Liu et al.

CVPR 2025
120
citations
#15

WonderWorld: Interactive 3D Scene Generation from a Single Image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann et al.

CVPR 2025
120
citations
#16

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang et al.

CVPR 2025
119
citations
#17

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

CVPR 2025
98
citations
#18

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He et al.

CVPR 2025
96
citations
#19

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee et al.

CVPR 2025
96
citations
#20

LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong, Xiyao Wang, Dong Guo et al.

CVPR 2025
95
citations