Most Cited CVPR 2024 "correctness" Papers

2,716 papers found • Page 1 of 14

#1

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li et al.

CVPR 2024highlightarXiv:2310.03744
4359
citations
#2

DETRs Beat YOLOs on Real-time Object Detection

Yian Zhao, Wenyu Lv, Shangliang Xu et al.

CVPR 2024arXiv:2304.08069
2565
citations
#3

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

CVPR 2024arXiv:2312.14238
2295
citations
#4

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

CVPR 2024arXiv:2311.16502
1715
citations
#5

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

CVPR 2024arXiv:2401.10891
1479
citations
#6

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Guanjun Wu, Taoran Yi, Jiemin Fang et al.

CVPR 2024arXiv:2310.08528
1110
citations
#7

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

CVPR 2024highlightarXiv:2311.17982
1072
citations
#8

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon et al.

CVPR 2024arXiv:2312.14132
1005
citations
#9

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

CVPR 2024highlightarXiv:2311.17005
902
citations
#10

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen et al.

CVPR 2024arXiv:2308.00692
742
citations
#11

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou et al.

CVPR 2024arXiv:2309.13101
710
citations
#12

VILA: On Pre-training for Visual Language Models

Ji Lin, Danny Yin, Wei Ping et al.

CVPR 2024arXiv:2312.07533
701
citations
#13

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu

CVPR 2024arXiv:2311.17117
684
citations
#14

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge et al.

CVPR 2024arXiv:2401.17270
682
citations
#15

Wonder3D: Single Image to 3D using Cross-Domain Diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin et al.

CVPR 2024highlightarXiv:2310.15008
672
citations
#16

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

Antoine Guédon, Vincent Lepetit

CVPR 2024arXiv:2311.12775
654
citations
#17

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv et al.

CVPR 2024highlightarXiv:2312.08914
629
citations
#18

Mip-Splatting: Alias-free 3D Gaussian Splatting

Zehao Yu, Anpei Chen, Binbin Huang et al.

CVPR 2024arXiv:2311.16493
627
citations
#19

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

CVPR 2024highlightarXiv:2312.00109
620
citations
#20

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye et al.

CVPR 2024highlightarXiv:2311.04257
614
citations
#21

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024arXiv:2401.06209
593
citations
#22

One-step Diffusion with Distribution Matching Distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang et al.

CVPR 2024arXiv:2311.18828
579
citations
#23

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024arXiv:2401.12168
578
citations
#24

Diffusion Model Alignment Using Direct Preference Optimization

Bram Wallace, Meihua Dang, Rafael Rafailov et al.

CVPR 2024arXiv:2311.12908
561
citations
#25

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi et al.

CVPR 2024arXiv:2312.12337
516
citations
#26

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Haoxin Chen, Yong Zhang, Xiaodong Cun et al.

CVPR 2024arXiv:2401.09047
512
citations
#27

SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula et al.

CVPR 2024arXiv:2312.02126
497
citations
#28

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen et al.

CVPR 2024highlightarXiv:2311.16922
487
citations
#29

RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang, Hui Chen, Zijia Lin et al.

CVPR 2024arXiv:2307.09283
481
citations
#30

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang et al.

CVPR 2024arXiv:2307.16449
471
citations
#31

Gaussian Splatting SLAM

Hidenobu Matsuki, Riku Murai, Paul Kelly et al.

CVPR 2024highlightarXiv:2312.06741
462
citations
#32

Generative Multimodal Models are In-Context Learners

Quan Sun, Yufeng Cui, Xiaosong Zhang et al.

CVPR 2024arXiv:2312.13286
438
citations
#33

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz et al.

CVPR 2024highlightarXiv:2312.08344
435
citations
#34

AnyDoor: Zero-shot Object-level Image Customization

Xi Chen, Lianghua Huang, Yu Liu et al.

CVPR 2024arXiv:2307.09481
415
citations
#35

GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly et al.

CVPR 2024arXiv:2311.03356
411
citations
#36

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao, Haiping Wu, Weijian Xu et al.

CVPR 2024arXiv:2311.06242
409
citations
#37

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024highlightarXiv:2311.06607
392
citations
#38

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024arXiv:2310.14566
392
citations
#39

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

CVPR 2024highlightarXiv:2311.17911
385
citations
#40

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin et al.

CVPR 2024arXiv:2304.03411
377
citations
#41

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

CVPR 2024highlightarXiv:2311.11700
376
citations
#42

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li et al.

CVPR 2024arXiv:2312.02051
372
citations
#43

LangSplat: 3D Language Gaussian Splatting

Minghan Qin, Wanhua Li, Jiawei ZHOU et al.

CVPR 2024highlightarXiv:2312.16084
368
citations
#44

Compact 3D Gaussian Representation for Radiance Field

Joo Chan Lee, Daniel Rho, Xiangyu Sun et al.

CVPR 2024highlightarXiv:2311.13681
366
citations
#45

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

CVPR 2024highlightarXiv:2311.08046
364
citations
#46

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

CVPR 2024arXiv:2312.00849
361
citations
#47

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan et al.

CVPR 2024arXiv:2312.07920
355
citations
#48

Analyzing and Improving the Training Dynamics of Diffusion Models

Tero Karras, Miika Aittala, Jaakko Lehtinen et al.

CVPR 2024arXiv:2312.02696
353
citations
#49

Rewrite the Stars

Xu Ma, Xiyang Dai, Yue Bai et al.

CVPR 2024arXiv:2403.19967
352
citations
#50

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024arXiv:2402.19479
351
citations
#51

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024arXiv:2312.14135
345
citations
#52

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani et al.

CVPR 2024arXiv:2311.18259
343
citations
#53

Poly Kernel Inception Network for Remote Sensing Detection

Xinhao Cai, Qiuxia Lai, Yuwei Wang et al.

CVPR 2024arXiv:2403.06258
337
citations
#54

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Shijie Zhou, Haoran Chang, Sicheng Jiang et al.

CVPR 2024highlightarXiv:2312.03203
335
citations
#55

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Yiwen Chen, Zilong Chen, Chi Zhang et al.

CVPR 2024arXiv:2311.14521
333
citations
#56

Text-to-3D using Gaussian Splatting

Zilong Chen, Feng Wang, Yikai Wang et al.

CVPR 2024arXiv:2309.16585
333
citations
#57

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang et al.

CVPR 2024arXiv:2312.02145
332
citations
#58

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

CVPR 2024arXiv:2312.13150
328
citations
#59

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu et al.

CVPR 2024highlightarXiv:2311.12198
328
citations
#60

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Zhen Li, Mingdeng Cao, Xintao Wang et al.

CVPR 2024arXiv:2312.04461
327
citations
#61

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew et al.

CVPR 2024arXiv:2311.16498
327
citations
#62

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer et al.

CVPR 2024arXiv:2311.15826
319
citations
#63

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Yujun Shi, Chuhui Xue, Jun Hao Liew et al.

CVPR 2024highlightarXiv:2306.14435
314
citations
#64

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

CVPR 2024arXiv:2303.04761
312
citations
#65

UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis et al.

CVPR 2024highlightarXiv:2403.18913
312
citations
#66

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yihua Huang, Yangtian Sun, Ziyi Yang et al.

CVPR 2024arXiv:2312.14937
311
citations
#67

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

Zhan Li, Zhang Chen, Zhong Li et al.

CVPR 2024arXiv:2312.16812
309
citations
#68

MoMask: Generative Masked Modeling of 3D Human Motions

chuan guo, Yuxuan Mu, Muhammad Gohar Javed et al.

CVPR 2024arXiv:2312.00063
300
citations
#69

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit et al.

CVPR 2024highlightarXiv:2401.09603
294
citations
#70

ReconFusion: 3D Reconstruction with Diffusion Priors

Rundi Wu, Ben Mildenhall, Philipp Henzler et al.

CVPR 2024arXiv:2312.02981
293
citations
#71

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen et al.

CVPR 2024arXiv:2311.07885
288
citations
#72

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

Yixun Liang, Xin Yang, Jiantao Lin et al.

CVPR 2024highlightarXiv:2311.11284
282
citations
#73

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Jiasen Lu, Christopher Clark, Sangho Lee et al.

CVPR 2024highlightarXiv:2312.17172
280
citations
#74

InceptionNeXt: When Inception Meets ConvNeXt

Weihao Yu, Pan Zhou, Shuicheng Yan et al.

CVPR 2024arXiv:2303.16900
280
citations
#75

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Dai Shi

CVPR 2024arXiv:2311.17132
279
citations
#76

DeepCache: Accelerating Diffusion Models for Free

Xinyin Ma, Gongfan Fang, Xinchao Wang

CVPR 2024arXiv:2312.00858
279
citations
#77

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.

CVPR 2024arXiv:2312.09147
278
citations
#78

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

CVPR 2024arXiv:2312.16256
277
citations
#79

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun et al.

CVPR 2024arXiv:2311.16518
274
citations
#80

Reconstructing Hands in 3D with Transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic et al.

CVPR 2024arXiv:2312.05251
258
citations
#81

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen et al.

CVPR 2024highlightarXiv:2311.18445
257
citations
#82

On Scaling Up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski et al.

CVPR 2024arXiv:2305.18565
256
citations
#83

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

Yuqi Wang, Jiawei He, Lue Fan et al.

CVPR 2024arXiv:2311.17918
255
citations
#84

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Hao Shao, Yuxuan Hu, Letian Wang et al.

CVPR 2024arXiv:2312.07488
251
citations
#85

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Shelly Sheynin, Adam Polyak, Uriel Singer et al.

CVPR 2024highlightarXiv:2311.10089
250
citations
#86

RoMa: Robust Dense Feature Matching

Johan Edstedt, Qiyu Sun, Georg Bökman et al.

CVPR 2024arXiv:2305.15404
248
citations
#87

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.

CVPR 2024arXiv:2310.11440
248
citations
#88

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli

CVPR 2024arXiv:2304.06140
247
citations
#89

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.

CVPR 2024highlightarXiv:2312.00863
246
citations
#90

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Taoran Yi, Jiemin Fang, Junjie Wang et al.

CVPR 2024arXiv:2310.08529
246
citations
#91

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

CVPR 2024arXiv:2312.10115
244
citations
#92

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge et al.

CVPR 2024arXiv:2311.15599
243
citations
#93

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

Jiahe Li, Jiawei Zhang, Xiao Bai et al.

CVPR 2024arXiv:2403.06912
242
citations
#94

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld et al.

CVPR 2024highlightarXiv:2312.02069
238
citations
#95

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

Fanghua Yu, Jinjin Gu, Zheyuan Li et al.

CVPR 2024arXiv:2401.13627
237
citations
#96

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

CVPR 2024arXiv:2312.00785
235
citations
#97

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Nataniel Ruiz, Yuanzhen Li, Varun Jampani et al.

CVPR 2024arXiv:2307.06949
232
citations
#98

GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces

Yingwenqi Jiang, Jiadong Tu, Yuan Liu et al.

CVPR 2024arXiv:2311.17977
232
citations
#99

Style Aligned Image Generation via Shared Attention

Amir Hertz, Andrey Voynov, Shlomi Fruchter et al.

CVPR 2024arXiv:2312.02133
230
citations
#100

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang et al.

CVPR 2024
230
citations
#101

FreeU: Free Lunch in Diffusion U-Net

Chenyang Si, Ziqi Huang, Yuming Jiang et al.

CVPR 2024arXiv:2309.11497
227
citations
#102

MACE: Mass Concept Erasure in Diffusion Models

Shilin Lu, Zilan Wang, Leyang Li et al.

CVPR 2024arXiv:2403.06135
226
citations
#103

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

Yufei Wang, Wenhan Yang, Xinyuan Chen et al.

CVPR 2024arXiv:2311.14760
226
citations
#104

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo

CVPR 2024highlightarXiv:2312.09008
225
citations
#105

Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis

Simon Niedermayr, Josef Stumpfegger, rüdiger westermann

CVPR 2024arXiv:2401.02436
222
citations
#106

EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Md Mostafijur Rahman, Mustafa Munir, Radu Marculescu

CVPR 2024arXiv:2405.06880
221
citations
#107

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang et al.

CVPR 2024arXiv:2402.17427
219
citations
#108

MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov et al.

CVPR 2024highlightarXiv:2311.15475
214
citations
#109

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Kai Yang, Jian Tao, Jiafei Lyu et al.

CVPR 2024arXiv:2311.13231
209
citations
#110

Honeybee: Locality-enhanced Projector for Multimodal LLM

Junbum Cha, Woo-Young Kang, Jonghwan Mun et al.

CVPR 2024highlightarXiv:2312.06742
208
citations
#111

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic et al.

CVPR 2024arXiv:2312.09228
207
citations
#112

COLMAP-Free 3D Gaussian Splatting

Yang Fu, Sifei Liu, Amey Kulkarni et al.

CVPR 2024highlightarXiv:2312.07504
201
citations
#113

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang et al.

CVPR 2024arXiv:2312.03700
201
citations
#114

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Le Xue, Ning Yu, Shu Zhang et al.

CVPR 2024arXiv:2305.08275
198
citations
#115

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang et al.

CVPR 2024arXiv:2312.02134
197
citations
#116

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei et al.

CVPR 2024arXiv:2312.02228
197
citations
#117

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho, Heeseong Shin, Sunghwan Hong et al.

CVPR 2024highlightarXiv:2303.11797
193
citations
#118

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yue Yang, Fan-Yun Sun, Luca Weihs et al.

CVPR 2024arXiv:2312.09067
192
citations
#119

GS-IR: 3D Gaussian Splatting for Inverse Rendering

Zhihao Liang, Qi Zhang, Ying Feng et al.

CVPR 2024arXiv:2311.16473
191
citations
#120

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang et al.

CVPR 2024arXiv:2403.11549
190
citations
#121

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

Youtian Lin, Zuozhuo Dai, Siyu Zhu et al.

CVPR 2024highlightarXiv:2312.03431
190
citations
#122

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang et al.

CVPR 2024arXiv:2404.05726
188
citations
#123

Putting the Object Back into Video Object Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price et al.

CVPR 2024highlightarXiv:2310.12982
185
citations
#124

RMT: Retentive Networks Meet Vision Transformers

Qihang Fan, Huaibo Huang, Mingrui Chen et al.

CVPR 2024arXiv:2309.11523
184
citations
#125

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.

CVPR 2024arXiv:2312.16217
182
citations
#126

InstanceDiffusion: Instance-level Control for Image Generation

XuDong Wang, Trevor Darrell, Sai Saketh Rambhatla et al.

CVPR 2024arXiv:2402.03290
180
citations
#127

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim, Gyojung Gu, Minho Park et al.

CVPR 2024arXiv:2312.01725
179
citations
#128

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

CVPR 2024arXiv:2311.18482
177
citations
#129

RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

Ximiao Zhang, Min Xu, Xiuzhuang Zhou

CVPR 2024arXiv:2403.05897
176
citations
#130

BioCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens, Jiaman Wu, Matthew Thompson et al.

CVPR 2024arXiv:2311.18803
176
citations
#131

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Wang Zhaolong, Zhuang Li et al.

CVPR 2024arXiv:2403.05087
174
citations
#132

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Yukang Cao, Yan-Pei Cao, Kai Han et al.

CVPR 2024arXiv:2304.00916
174
citations
#133

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong et al.

CVPR 2024arXiv:2311.17984
174
citations
#134

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Huan Ling, Seung Wook Kim, Antonio Torralba et al.

CVPR 2024highlightarXiv:2312.13763
174
citations
#135

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Lingteng Qiu, Guanying Chen, Xiaodong Gu et al.

CVPR 2024highlightarXiv:2311.16918
173
citations
#136

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras

Huajian Huang, Longwei Li, Hui Cheng et al.

CVPR 2024arXiv:2311.16728
173
citations
#137

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Zhiqi Li, Zhiding Yu, Shiyi Lan et al.

CVPR 2024arXiv:2312.03031
172
citations
#138

Logit Standardization in Knowledge Distillation

Shangquan Sun, Wenqi Ren, Jingzhi Li et al.

CVPR 2024highlightarXiv:2403.01427
172
citations
#139

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra, Brandon Huang, Trevor Darrell et al.

CVPR 2024arXiv:2311.17076
171
citations
#140

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Zeyi Sun, Ye Fang, Tong Wu et al.

CVPR 2024arXiv:2312.03818
170
citations
#141

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

Shoukang Hu, Tao Hu, Ziwei Liu

CVPR 2024arXiv:2312.02973
170
citations
#142

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

Yanwu Xu, Yang Zhao, Zhisheng Xiao et al.

CVPR 2024highlightarXiv:2311.09257
170
citations
#143

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin, Juyong Kim, Eni Halilaj et al.

CVPR 2024arXiv:2312.07531
169
citations
#144

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024arXiv:2303.09618
168
citations
#145

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Junjie Wang, Jiemin Fang, Xiaopeng Zhang et al.

CVPR 2024arXiv:2311.16037
168
citations
#146

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.

CVPR 2024highlightarXiv:2312.02155
166
citations
#147

Equivariant Multi-Modality Image Fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang et al.

CVPR 2024arXiv:2305.11443
164
citations
#148

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Zigang Geng, Binxin Yang, Tiankai Hang et al.

CVPR 2024arXiv:2309.03895
162
citations
#149

Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung, Songwei Ge, Jia-Bin Huang

CVPR 2024arXiv:2306.05427
162
citations
#150

Multi-Modal Hallucination Control by Visual Information Grounding

Alessandro Favero, Luca Zancato, Matthew Trager et al.

CVPR 2024arXiv:2403.14003
160
citations
#151

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

Zhiwen Yan, Weng Fei Low, Yu Chen et al.

CVPR 2024arXiv:2311.17089
159
citations
#152

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing et al.

CVPR 2024arXiv:2312.04433
158
citations
#153

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Haoning Wu, Zicheng Zhang, Erli Zhang et al.

CVPR 2024arXiv:2311.06783
158
citations
#154

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Mu Cai, Haotian Liu, Siva Mustikovela et al.

CVPR 2024arXiv:2312.00784
155
citations
#155

HUGS: Human Gaussian Splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel et al.

CVPR 2024arXiv:2311.17910
153
citations
#156

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang et al.

CVPR 2024arXiv:2311.17516
150
citations
#157

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

CVPR 2024arXiv:2312.10032
149
citations
#158

Make Pixels Dance: High-Dynamic Video Generation

Yan Zeng, Guoqiang Wei, Jiani Zheng et al.

CVPR 2024arXiv:2311.10982
149
citations
#159

Infrared Small Target Detection with Scale and Location Sensitivity

Qiankun Liu, Rui Liu, Bolun Zheng et al.

CVPR 2024arXiv:2403.19366
149
citations
#160

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen et al.

CVPR 2024highlightarXiv:2401.06197
148
citations
#161

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Yuelang Xu, Benwang Chen, Zhe Li et al.

CVPR 2024
147
citations
#162

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

Chen Zhao, Weiling Cai, Chenyu Dong et al.

CVPR 2024arXiv:2311.16845
147
citations
#163

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Tong Wu, Guandao Yang, Zhibing Li et al.

CVPR 2024arXiv:2401.04092
146
citations
#164

Optimal Transport Aggregation for Visual Place Recognition

Sergio Izquierdo, Javier Civera

CVPR 2024arXiv:2311.15937
146
citations
#165

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Yifan Wang, Xingyi He, Sida Peng et al.

CVPR 2024highlightarXiv:2403.04765
146
citations
#166

XFeat: Accelerated Features for Lightweight Image Matching

Guilherme Potje, Felipe Cadar, André Araujo et al.

CVPR 2024arXiv:2404.19174
145
citations
#167

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu, Tianbin, Quanfeng Lu et al.

CVPR 2024arXiv:2402.09181
144
citations
#168

Neural Video Compression with Feature Modulation

Jiahao Li, Bin Li, Yan Lu

CVPR 2024arXiv:2402.17414
143
citations
#169

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang et al.

CVPR 2024highlightarXiv:2312.06739
143
citations
#170

3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos

Jiakai Sun, Han Jiao, Guangyuan Li et al.

CVPR 2024highlightarXiv:2403.01444
143
citations
#171

SpatialTracker: Tracking Any 2D Pixels in 3D Space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang et al.

CVPR 2024highlightarXiv:2404.04319
143
citations
#172

Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection

Huan Liu, Zichang Tan, Chuangchuang Tan et al.

CVPR 2024arXiv:2312.16649
141
citations
#173

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.

CVPR 2024arXiv:2311.11278
140
citations
#174

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Hongyu Zhou, Jiahao Shao, Lu Xu et al.

CVPR 2024arXiv:2403.12722
139
citations
#175

DisCo: Disentangled Control for Realistic Human Dance Generation

Tan Wang, Linjie Li, Kevin Lin et al.

CVPR 2024arXiv:2307.00040
139
citations
#176

Probing the 3D Awareness of Visual Foundation Models

Mohamed El Banani, Amit Raj, Kevis-kokitsi Maninis et al.

CVPR 2024arXiv:2404.08636
138
citations
#177

Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

Mukul Khanna, Yongsen Mao, Hanxiao Jiang et al.

CVPR 2024arXiv:2306.11290
134
citations
#178

Rich Human Feedback for Text-to-Image Generation

Youwei Liang, Junfeng He, Gang Li et al.

CVPR 2024arXiv:2312.10240
134
citations
#179

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Chunlong Xia, Xinliang Wang, Feng Lv et al.

CVPR 2024highlightarXiv:2403.07392
133
citations
#180

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng et al.

CVPR 2024highlightarXiv:2312.03806
132
citations
#181

VLP: Vision Language Planning for Autonomous Driving

Chenbin Pan, Burhan Yaman, Tommaso Nesti et al.

CVPR 2024arXiv:2401.05577
132
citations
#182

LEDITS++: Limitless Image Editing using Text-to-Image Models

Manuel Brack, Felix Friedrich, Katharina Kornmeier et al.

CVPR 2024arXiv:2311.16711
131
citations
#183

GART: Gaussian Articulated Template Models

Jiahui Lei, Yufu Wang, Georgios Pavlakos et al.

CVPR 2024highlightarXiv:2311.16099
131
citations
#184

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, HAO ZHANG et al.

CVPR 2024arXiv:2403.16387
130
citations
#185

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Tai Wang, Xiaohan Mao, Chenming Zhu et al.

CVPR 2024arXiv:2312.16170
130
citations
#186

NeuRAD: Neural Rendering for Autonomous Driving

Adam Tonderski, Carl Lindström, Georg Hess et al.

CVPR 2024highlightarXiv:2311.15260
130
citations
#187

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

CVPR 2024arXiv:2312.10103
130
citations
#188

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

Qifan Yu, Juncheng Li, Longhui Wei et al.

CVPR 2024arXiv:2311.13614
129
citations
#189

Relightable Gaussian Codec Avatars

Shunsuke Saito, Gabriel Schwartz, Tomas Simon et al.

CVPR 2024arXiv:2312.03704
129
citations
#190

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Jiehong Lin, lihua liu, Dekun Lu et al.

CVPR 2024arXiv:2311.15707
129
citations
#191

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Bingyan Liu, Chengyu Wang, Tingfeng Cao et al.

CVPR 2024arXiv:2403.03431
128
citations
#192

AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One

Mike Ranzinger, Greg Heinrich, Jan Kautz et al.

CVPR 2024arXiv:2312.06709
128
citations
#193

Generalized Predictive Model for Autonomous Driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu et al.

CVPR 2024highlightarXiv:2403.09630
128
citations
#194

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.

CVPR 2024arXiv:2403.12580
127
citations
#195

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Yan-Shuo Liang, Wu-Jun Li

CVPR 2024arXiv:2404.00228
126
citations
#196

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld et al.

CVPR 2024arXiv:2312.03209
126
citations
#197

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

Yuxuan Zhang, Yiren Song, Jiaming Liu et al.

CVPR 2024arXiv:2312.16272
125
citations
#198

WonderJourney: Going from Anywhere to Everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur et al.

CVPR 2024arXiv:2312.03884
124
citations
#199

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang et al.

CVPR 2024arXiv:2312.00777
123
citations
#200

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang et al.

CVPR 2024arXiv:2311.12754
123
citations
PreviousNext