Most Cited CVPR Poster "performance-cost tradeoff" Papers

5,589 papers found • Page 1 of 28

#1

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li et al.

CVPR 2024highlightarXiv:2310.03744
4359
citations
#2

DETRs Beat YOLOs on Real-time Object Detection

Yian Zhao, Wenyu Lv, Shangliang Xu et al.

CVPR 2024arXiv:2304.08069
2565
citations
#3

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

CVPR 2024arXiv:2312.14238
2295
citations
#4

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

CVPR 2024arXiv:2311.16502
1715
citations
#5

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

CVPR 2024arXiv:2401.10891
1479
citations
#6

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Guanjun Wu, Taoran Yi, Jiemin Fang et al.

CVPR 2024arXiv:2310.08528
1110
citations
#7

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

CVPR 2024highlightarXiv:2311.17982
1072
citations
#8

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon et al.

CVPR 2024arXiv:2312.14132
1005
citations
#9

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

CVPR 2025highlightarXiv:2405.21075
917
citations
#10

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

CVPR 2024highlightarXiv:2311.17005
902
citations
#11

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen et al.

CVPR 2024arXiv:2308.00692
742
citations
#12

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou et al.

CVPR 2024arXiv:2309.13101
710
citations
#13

VILA: On Pre-training for Visual Language Models

Ji Lin, Danny Yin, Wei Ping et al.

CVPR 2024arXiv:2312.07533
701
citations
#14

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu

CVPR 2024arXiv:2311.17117
684
citations
#15

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge et al.

CVPR 2024arXiv:2401.17270
682
citations
#16

Wonder3D: Single Image to 3D using Cross-Domain Diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin et al.

CVPR 2024highlightarXiv:2310.15008
672
citations
#17

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

Antoine Guédon, Vincent Lepetit

CVPR 2024arXiv:2311.12775
654
citations
#18

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv et al.

CVPR 2024highlightarXiv:2312.08914
629
citations
#19

Mip-Splatting: Alias-free 3D Gaussian Splatting

Zehao Yu, Anpei Chen, Binbin Huang et al.

CVPR 2024arXiv:2311.16493
627
citations
#20

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

CVPR 2024highlightarXiv:2312.00109
620
citations
#21

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye et al.

CVPR 2024highlightarXiv:2311.04257
614
citations
#22

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev et al.

CVPR 2025arXiv:2503.11651
612
citations
#23

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024arXiv:2401.06209
593
citations
#24

One-step Diffusion with Distribution Matching Distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang et al.

CVPR 2024arXiv:2311.18828
579
citations
#25

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024arXiv:2401.12168
578
citations
#26

Diffusion Model Alignment Using Direct Preference Optimization

Bram Wallace, Meihua Dang, Rafael Rafailov et al.

CVPR 2024arXiv:2311.12908
561
citations
#27

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi et al.

CVPR 2024arXiv:2312.12337
516
citations
#28

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Haoxin Chen, Yong Zhang, Xiaodong Cun et al.

CVPR 2024arXiv:2401.09047
512
citations
#29

SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula et al.

CVPR 2024arXiv:2312.02126
497
citations
#30

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen et al.

CVPR 2024highlightarXiv:2311.16922
487
citations
#31

RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang, Hui Chen, Zijia Lin et al.

CVPR 2024arXiv:2307.09283
481
citations
#32

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang et al.

CVPR 2024arXiv:2307.16449
471
citations
#33

Gaussian Splatting SLAM

Hidenobu Matsuki, Riku Murai, Paul Kelly et al.

CVPR 2024highlightarXiv:2312.06741
462
citations
#34

Generative Multimodal Models are In-Context Learners

Quan Sun, Yufeng Cui, Xiaosong Zhang et al.

CVPR 2024arXiv:2312.13286
438
citations
#35

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz et al.

CVPR 2024highlightarXiv:2312.08344
435
citations
#36

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng XIANG, Zelong Lv, Sicheng Xu et al.

CVPR 2025highlightarXiv:2412.01506
434
citations
#37

AnyDoor: Zero-shot Object-level Image Customization

Xi Chen, Lianghua Huang, Yu Liu et al.

CVPR 2024arXiv:2307.09481
415
citations
#38

GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly et al.

CVPR 2024arXiv:2311.03356
411
citations
#39

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao, Haiping Wu, Weijian Xu et al.

CVPR 2024arXiv:2311.06242
409
citations
#40

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024highlightarXiv:2311.06607
392
citations
#41

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024arXiv:2310.14566
392
citations
#42

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

CVPR 2024highlightarXiv:2311.17911
385
citations
#43

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin et al.

CVPR 2024arXiv:2304.03411
377
citations
#44

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

CVPR 2024highlightarXiv:2311.11700
376
citations
#45

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li et al.

CVPR 2024arXiv:2312.02051
372
citations
#46

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.

CVPR 2025arXiv:2412.14171
371
citations
#47

LangSplat: 3D Language Gaussian Splatting

Minghan Qin, Wanhua Li, Jiawei ZHOU et al.

CVPR 2024highlightarXiv:2312.16084
368
citations
#48

Compact 3D Gaussian Representation for Radiance Field

Joo Chan Lee, Daniel Rho, Xiangyu Sun et al.

CVPR 2024highlightarXiv:2311.13681
366
citations
#49

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

CVPR 2024highlightarXiv:2311.08046
364
citations
#50

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

CVPR 2024arXiv:2312.00849
361
citations
#51

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan et al.

CVPR 2024arXiv:2312.07920
355
citations
#52

Analyzing and Improving the Training Dynamics of Diffusion Models

Tero Karras, Miika Aittala, Jaakko Lehtinen et al.

CVPR 2024arXiv:2312.02696
353
citations
#53

Rewrite the Stars

Xu Ma, Xiyang Dai, Yue Bai et al.

CVPR 2024arXiv:2403.19967
352
citations
#54

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024arXiv:2402.19479
351
citations
#55

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024arXiv:2312.14135
345
citations
#56

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani et al.

CVPR 2024arXiv:2311.18259
343
citations
#57

Poly Kernel Inception Network for Remote Sensing Detection

Xinhao Cai, Qiuxia Lai, Yuwei Wang et al.

CVPR 2024arXiv:2403.06258
337
citations
#58

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Shijie Zhou, Haoran Chang, Sicheng Jiang et al.

CVPR 2024highlightarXiv:2312.03203
335
citations
#59

Text-to-3D using Gaussian Splatting

Zilong Chen, Feng Wang, Yikai Wang et al.

CVPR 2024arXiv:2309.16585
333
citations
#60

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Yiwen Chen, Zilong Chen, Chi Zhang et al.

CVPR 2024arXiv:2311.14521
333
citations
#61

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang et al.

CVPR 2024arXiv:2312.02145
332
citations
#62

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

CVPR 2024arXiv:2312.13150
328
citations
#63

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu et al.

CVPR 2024highlightarXiv:2311.12198
328
citations
#64

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew et al.

CVPR 2024arXiv:2311.16498
327
citations
#65

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Zhen Li, Mingdeng Cao, Xintao Wang et al.

CVPR 2024arXiv:2312.04461
327
citations
#66

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer et al.

CVPR 2024arXiv:2311.15826
319
citations
#67

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Yujun Shi, Chuhui Xue, Jun Hao Liew et al.

CVPR 2024highlightarXiv:2306.14435
314
citations
#68

UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis et al.

CVPR 2024highlightarXiv:2403.18913
312
citations
#69

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

CVPR 2024arXiv:2303.04761
312
citations
#70

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yihua Huang, Yangtian Sun, Ziyi Yang et al.

CVPR 2024arXiv:2312.14937
311
citations
#71

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

Zhan Li, Zhang Chen, Zhong Li et al.

CVPR 2024arXiv:2312.16812
309
citations
#72

MoMask: Generative Masked Modeling of 3D Human Motions

chuan guo, Yuxuan Mu, Muhammad Gohar Javed et al.

CVPR 2024arXiv:2312.00063
300
citations
#73

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit et al.

CVPR 2024highlightarXiv:2401.09603
294
citations
#74

ReconFusion: 3D Reconstruction with Diffusion Priors

Rundi Wu, Ben Mildenhall, Philipp Henzler et al.

CVPR 2024arXiv:2312.02981
293
citations
#75

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu et al.

CVPR 2025arXiv:2410.13848
293
citations
#76

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen et al.

CVPR 2024arXiv:2311.07885
288
citations
#77

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

Yixun Liang, Xin Yang, Jiantao Lin et al.

CVPR 2024highlightarXiv:2311.11284
282
citations
#78

InceptionNeXt: When Inception Meets ConvNeXt

Weihao Yu, Pan Zhou, Shuicheng Yan et al.

CVPR 2024arXiv:2303.16900
280
citations
#79

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Jiasen Lu, Christopher Clark, Sangho Lee et al.

CVPR 2024highlightarXiv:2312.17172
280
citations
#80

DeepCache: Accelerating Diffusion Models for Free

Xinyin Ma, Gongfan Fang, Xinchao Wang

CVPR 2024arXiv:2312.00858
279
citations
#81

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Dai Shi

CVPR 2024arXiv:2311.17132
279
citations
#82

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.

CVPR 2024arXiv:2312.09147
278
citations
#83

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

CVPR 2024arXiv:2312.16256
277
citations
#84

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun et al.

CVPR 2024arXiv:2311.16518
274
citations
#85

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou et al.

CVPR 2025arXiv:2409.11340
271
citations
#86

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz

CVPR 2025arXiv:2407.08083
264
citations
#87

Reconstructing Hands in 3D with Transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic et al.

CVPR 2024arXiv:2312.05251
258
citations
#88

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen et al.

CVPR 2024highlightarXiv:2311.18445
257
citations
#89

On Scaling Up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski et al.

CVPR 2024arXiv:2305.18565
256
citations
#90

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

Yuqi Wang, Jiawei He, Lue Fan et al.

CVPR 2024arXiv:2311.17918
255
citations
#91

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Hao Shao, Yuxuan Hu, Letian Wang et al.

CVPR 2024arXiv:2312.07488
251
citations
#92

Continuous 3D Perception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski et al.

CVPR 2025arXiv:2501.12387
250
citations
#93

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Shelly Sheynin, Adam Polyak, Uriel Singer et al.

CVPR 2024highlightarXiv:2311.10089
250
citations
#94

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.

CVPR 2024arXiv:2310.11440
248
citations
#95

RoMa: Robust Dense Feature Matching

Johan Edstedt, Qiyu Sun, Georg Bökman et al.

CVPR 2024arXiv:2305.15404
248
citations
#96

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli

CVPR 2024arXiv:2304.06140
247
citations
#97

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Taoran Yi, Jiemin Fang, Junjie Wang et al.

CVPR 2024arXiv:2310.08529
246
citations
#98

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.

CVPR 2024highlightarXiv:2312.00863
246
citations
#99

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

CVPR 2025arXiv:2503.22020
245
citations
#100

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

CVPR 2024arXiv:2312.10115
244
citations
#101

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge et al.

CVPR 2024arXiv:2311.15599
243
citations
#102

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

Jiahe Li, Jiawei Zhang, Xiao Bai et al.

CVPR 2024arXiv:2403.06912
242
citations
#103

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld et al.

CVPR 2024highlightarXiv:2312.02069
238
citations
#104

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

Fanghua Yu, Jinjin Gu, Zheyuan Li et al.

CVPR 2024arXiv:2401.13627
237
citations
#105

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

CVPR 2024arXiv:2312.00785
235
citations
#106

GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces

Yingwenqi Jiang, Jiadong Tu, Yuan Liu et al.

CVPR 2024arXiv:2311.17977
232
citations
#107

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Nataniel Ruiz, Yuanzhen Li, Varun Jampani et al.

CVPR 2024arXiv:2307.06949
232
citations
#108

Style Aligned Image Generation via Shared Attention

Amir Hertz, Andrey Voynov, Shlomi Fruchter et al.

CVPR 2024arXiv:2312.02133
230
citations
#109

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang et al.

CVPR 2024
230
citations
#110

FreeU: Free Lunch in Diffusion U-Net

Chenyang Si, Ziqi Huang, Yuming Jiang et al.

CVPR 2024arXiv:2309.11497
227
citations
#111

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

Yufei Wang, Wenhan Yang, Xinyuan Chen et al.

CVPR 2024arXiv:2311.14760
226
citations
#112

MACE: Mass Concept Erasure in Diffusion Models

Shilin Lu, Zilan Wang, Leyang Li et al.

CVPR 2024arXiv:2403.06135
226
citations
#113

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo

CVPR 2024highlightarXiv:2312.09008
225
citations
#114

Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis

Simon Niedermayr, Josef Stumpfegger, rüdiger westermann

CVPR 2024arXiv:2401.02436
222
citations
#115

EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Md Mostafijur Rahman, Mustafa Munir, Radu Marculescu

CVPR 2024arXiv:2405.06880
221
citations
#116

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang et al.

CVPR 2024arXiv:2402.17427
219
citations
#117

MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov et al.

CVPR 2024highlightarXiv:2311.15475
214
citations
#118

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Kai Yang, Jian Tao, Jiafei Lyu et al.

CVPR 2024arXiv:2311.13231
209
citations
#119

Honeybee: Locality-enhanced Projector for Multimodal LLM

Junbum Cha, Woo-Young Kang, Jonghwan Mun et al.

CVPR 2024highlightarXiv:2312.06742
208
citations
#120

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic et al.

CVPR 2024arXiv:2312.09228
207
citations
#121

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Jian Han, Jinlai Liu, Yi Jiang et al.

CVPR 2025arXiv:2412.04431
201
citations
#122

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang et al.

CVPR 2024arXiv:2312.03700
201
citations
#123

COLMAP-Free 3D Gaussian Splatting

Yang Fu, Sifei Liu, Amey Kulkarni et al.

CVPR 2024highlightarXiv:2312.07504
201
citations
#124

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Le Xue, Ning Yu, Shu Zhang et al.

CVPR 2024arXiv:2305.08275
198
citations
#125

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang et al.

CVPR 2024arXiv:2312.02134
197
citations
#126

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei et al.

CVPR 2024arXiv:2312.02228
197
citations
#127

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho, Heeseong Shin, Sunghwan Hong et al.

CVPR 2024highlightarXiv:2303.11797
193
citations
#128

MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu, Xinchao Wang

CVPR 2025arXiv:2405.07992
193
citations
#129

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yue Yang, Fan-Yun Sun, Luca Weihs et al.

CVPR 2024arXiv:2312.09067
192
citations
#130

GS-IR: 3D Gaussian Splatting for Inverse Rendering

Zhihao Liang, Qi Zhang, Ying Feng et al.

CVPR 2024arXiv:2311.16473
191
citations
#131

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang et al.

CVPR 2024arXiv:2403.11549
190
citations
#132

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

Youtian Lin, Zuozhuo Dai, Siyu Zhu et al.

CVPR 2024highlightarXiv:2312.03431
190
citations
#133

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang et al.

CVPR 2024arXiv:2404.05726
188
citations
#134

Putting the Object Back into Video Object Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price et al.

CVPR 2024highlightarXiv:2310.12982
185
citations
#135

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Jingfeng Yao, Bin Yang, Xinggang Wang

CVPR 2025arXiv:2501.01423
184
citations
#136

RMT: Retentive Networks Meet Vision Transformers

Qihang Fan, Huaibo Huang, Mingrui Chen et al.

CVPR 2024arXiv:2309.11523
184
citations
#137

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.

CVPR 2024arXiv:2312.16217
182
citations
#138

InstanceDiffusion: Instance-level Control for Image Generation

XuDong Wang, Trevor Darrell, Sai Saketh Rambhatla et al.

CVPR 2024arXiv:2402.03290
180
citations
#139

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

Jianing "Jed" Yang, Alexander Sax, Kevin Liang et al.

CVPR 2025arXiv:2501.13928
180
citations
#140

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim, Gyojung Gu, Minho Park et al.

CVPR 2024arXiv:2312.01725
179
citations
#141

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

CVPR 2024arXiv:2311.18482
177
citations
#142

BioCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens, Jiaman Wu, Matthew Thompson et al.

CVPR 2024arXiv:2311.18803
176
citations
#143

RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

Ximiao Zhang, Min Xu, Xiuzhuang Zhou

CVPR 2024arXiv:2403.05897
176
citations
#144

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Wang Zhaolong, Zhuang Li et al.

CVPR 2024arXiv:2403.05087
174
citations
#145

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong et al.

CVPR 2024arXiv:2311.17984
174
citations
#146

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Yukang Cao, Yan-Pei Cao, Kai Han et al.

CVPR 2024arXiv:2304.00916
174
citations
#147

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Huan Ling, Seung Wook Kim, Antonio Torralba et al.

CVPR 2024highlightarXiv:2312.13763
174
citations
#148

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Lingteng Qiu, Guanying Chen, Xiaodong Gu et al.

CVPR 2024highlightarXiv:2311.16918
173
citations
#149

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras

Huajian Huang, Longwei Li, Hui Cheng et al.

CVPR 2024arXiv:2311.16728
173
citations
#150

Logit Standardization in Knowledge Distillation

Shangquan Sun, Wenqi Ren, Jingzhi Li et al.

CVPR 2024highlightarXiv:2403.01427
172
citations
#151

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Zhiqi Li, Zhiding Yu, Shiyi Lan et al.

CVPR 2024arXiv:2312.03031
172
citations
#152

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra, Brandon Huang, Trevor Darrell et al.

CVPR 2024arXiv:2311.17076
171
citations
#153

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

Shoukang Hu, Tao Hu, Ziwei Liu

CVPR 2024arXiv:2312.02973
170
citations
#154

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Zeyi Sun, Ye Fang, Tong Wu et al.

CVPR 2024arXiv:2312.03818
170
citations
#155

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

Yanwu Xu, Yang Zhao, Zhisheng Xiao et al.

CVPR 2024highlightarXiv:2311.09257
170
citations
#156

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin, Juyong Kim, Eni Halilaj et al.

CVPR 2024arXiv:2312.07531
169
citations
#157

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024arXiv:2303.09618
168
citations
#158

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Junjie Wang, Jiemin Fang, Xiaopeng Zhang et al.

CVPR 2024arXiv:2311.16037
168
citations
#159

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.

CVPR 2024highlightarXiv:2312.02155
166
citations
#160

Equivariant Multi-Modality Image Fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang et al.

CVPR 2024arXiv:2305.11443
164
citations
#161

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

Bencheng Liao, Shaoyu Chen, haoran yin et al.

CVPR 2025highlightarXiv:2411.15139
164
citations
#162

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.

CVPR 2025arXiv:2403.14773
164
citations
#163

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Ruicheng Wang, Sicheng Xu, Cassie Lee Dai et al.

CVPR 2025arXiv:2410.19115
162
citations
#164

Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung, Songwei Ge, Jia-Bin Huang

CVPR 2024arXiv:2306.05427
162
citations
#165

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Zigang Geng, Binxin Yang, Tiankai Hang et al.

CVPR 2024arXiv:2309.03895
162
citations
#166

Multi-Modal Hallucination Control by Visual Information Grounding

Alessandro Favero, Luca Zancato, Matthew Trager et al.

CVPR 2024arXiv:2403.14003
160
citations
#167

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

Zhiwen Yan, Weng Fei Low, Yu Chen et al.

CVPR 2024arXiv:2311.17089
159
citations
#168

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li et al.

CVPR 2025highlightarXiv:2409.02095
158
citations
#169

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Haoning Wu, Zicheng Zhang, Erli Zhang et al.

CVPR 2024arXiv:2311.06783
158
citations
#170

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing et al.

CVPR 2024arXiv:2312.04433
158
citations
#171

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi et al.

CVPR 2025arXiv:2412.04468
157
citations
#172

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.

CVPR 2025arXiv:2405.19209
156
citations
#173

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Mu Cai, Haotian Liu, Siva Mustikovela et al.

CVPR 2024arXiv:2312.00784
155
citations
#174

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025arXiv:2409.14485
155
citations
#175

HUGS: Human Gaussian Splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel et al.

CVPR 2024arXiv:2311.17910
153
citations
#176

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran et al.

CVPR 2025arXiv:2412.03572
151
citations
#177

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang et al.

CVPR 2024arXiv:2311.17516
150
citations
#178

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

CVPR 2024arXiv:2312.10032
149
citations
#179

Make Pixels Dance: High-Dynamic Video Generation

Yan Zeng, Guoqiang Wei, Jiani Zheng et al.

CVPR 2024arXiv:2311.10982
149
citations
#180

Infrared Small Target Detection with Scale and Location Sensitivity

Qiankun Liu, Rui Liu, Bolun Zheng et al.

CVPR 2024arXiv:2403.19366
149
citations
#181

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang et al.

CVPR 2025arXiv:2412.07772
149
citations
#182

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen et al.

CVPR 2024highlightarXiv:2401.06197
148
citations
#183

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Yuelang Xu, Benwang Chen, Zhe Li et al.

CVPR 2024
147
citations
#184

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

Chen Zhao, Weiling Cai, Chenyu Dong et al.

CVPR 2024arXiv:2311.16845
147
citations
#185

Optimal Transport Aggregation for Visual Place Recognition

Sergio Izquierdo, Javier Civera

CVPR 2024arXiv:2311.15937
146
citations
#186

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Yifan Wang, Xingyi He, Sida Peng et al.

CVPR 2024highlightarXiv:2403.04765
146
citations
#187

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Tong Wu, Guandao Yang, Zhibing Li et al.

CVPR 2024arXiv:2401.04092
146
citations
#188

XFeat: Accelerated Features for Lightweight Image Matching

Guilherme Potje, Felipe Cadar, André Araujo et al.

CVPR 2024arXiv:2404.19174
145
citations
#189

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.

CVPR 2025highlightarXiv:2503.03751
144
citations
#190

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu, Tianbin, Quanfeng Lu et al.

CVPR 2024arXiv:2402.09181
144
citations
#191

3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos

Jiakai Sun, Han Jiao, Guangyuan Li et al.

CVPR 2024highlightarXiv:2403.01444
143
citations
#192

SpatialTracker: Tracking Any 2D Pixels in 3D Space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang et al.

CVPR 2024highlightarXiv:2404.04319
143
citations
#193

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang et al.

CVPR 2024highlightarXiv:2312.06739
143
citations
#194

Neural Video Compression with Feature Modulation

Jiahao Li, Bin Li, Yan Lu

CVPR 2024arXiv:2402.17414
143
citations
#195

Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection

Huan Liu, Zichang Tan, Chuangchuang Tan et al.

CVPR 2024arXiv:2312.16649
141
citations
#196

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.

CVPR 2024arXiv:2311.11278
140
citations
#197

DisCo: Disentangled Control for Realistic Human Dance Generation

Tan Wang, Linjie Li, Kevin Lin et al.

CVPR 2024arXiv:2307.00040
139
citations
#198

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Hongyu Zhou, Jiahao Shao, Lu Xu et al.

CVPR 2024arXiv:2403.12722
139
citations
#199

Probing the 3D Awareness of Visual Foundation Models

Mohamed El Banani, Amit Raj, Kevis-kokitsi Maninis et al.

CVPR 2024arXiv:2404.08636
138
citations
#200

MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

Zhengqi Li, Richard Tucker, Forrester Cole et al.

CVPR 2025arXiv:2412.04463
136
citations
PreviousNext