Most Cited CVPR 2024 "factual consistency evaluation" Papers

2,716 papers found • Page 1 of 14

#1

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li et al.

CVPR 2024highlightarXiv:2310.03744
4313
citations
#2

DETRs Beat YOLOs on Real-time Object Detection

Yian Zhao, Wenyu Lv, Shangliang Xu et al.

CVPR 2024posterarXiv:2304.08069
2424
citations
#3

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

CVPR 2024posterarXiv:2312.14238
2210
citations
#4

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Guanjun Wu, Taoran Yi, Jiemin Fang et al.

CVPR 2024posterarXiv:2310.08528
1061
citations
#5

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

CVPR 2024highlightarXiv:2311.17982
996
citations
#6

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

CVPR 2024highlightarXiv:2311.17005
864
citations
#7

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen et al.

CVPR 2024posterarXiv:2308.00692
721
citations
#8

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou et al.

CVPR 2024posterarXiv:2309.13101
686
citations
#9

VILA: On Pre-training for Visual Language Models

Ji Lin, Danny Yin, Wei Ping et al.

CVPR 2024posterarXiv:2312.07533
685
citations
#10

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye et al.

CVPR 2024highlightarXiv:2311.04257
601
citations
#11

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

CVPR 2024highlightarXiv:2312.00109
589
citations
#12

One-step Diffusion with Distribution Matching Distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang et al.

CVPR 2024posterarXiv:2311.18828
577
citations
#13

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024posterarXiv:2401.06209
570
citations
#14

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024posterarXiv:2401.12168
550
citations
#15

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi et al.

CVPR 2024posterarXiv:2312.12337
496
citations
#16

SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula et al.

CVPR 2024posterarXiv:2312.02126
477
citations
#17

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang et al.

CVPR 2024posterarXiv:2307.16449
471
citations
#18

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen et al.

CVPR 2024highlightarXiv:2311.16922
449
citations
#19

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz et al.

CVPR 2024highlightarXiv:2312.08344
429
citations
#20

Generative Multimodal Models are In-Context Learners

Quan Sun, Yufeng Cui, Xiaosong Zhang et al.

CVPR 2024posterarXiv:2312.13286
422
citations
#21

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024posterarXiv:2310.14566
390
citations
#22

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024highlightarXiv:2311.06607
384
citations
#23

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

CVPR 2024highlightarXiv:2311.17911
382
citations
#24

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin et al.

CVPR 2024posterarXiv:2304.03411
369
citations
#25

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

CVPR 2024highlightarXiv:2311.11700
359
citations
#26

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li et al.

CVPR 2024posterarXiv:2312.02051
356
citations
#27

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

CVPR 2024highlightarXiv:2311.08046
354
citations
#28

Compact 3D Gaussian Representation for Radiance Field

Joo Chan Lee, Daniel Rho, Xiangyu Sun et al.

CVPR 2024highlightarXiv:2311.13681
348
citations
#29

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024posterarXiv:2402.19479
347
citations
#30

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

CVPR 2024posterarXiv:2312.00849
344
citations
#31

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024posterarXiv:2312.14135
344
citations
#32

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Shijie Zhou, Haoran Chang, Sicheng Jiang et al.

CVPR 2024highlightarXiv:2312.03203
335
citations
#33

Text-to-3D using Gaussian Splatting

Zilong Chen, Feng Wang, Yikai Wang et al.

CVPR 2024posterarXiv:2309.16585
330
citations
#34

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang et al.

CVPR 2024posterarXiv:2312.02145
328
citations
#35

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Yiwen Chen, Zilong Chen, Chi Zhang et al.

CVPR 2024posterarXiv:2311.14521
321
citations
#36

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew et al.

CVPR 2024posterarXiv:2311.16498
318
citations
#37

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

CVPR 2024posterarXiv:2312.13150
316
citations
#38

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

CVPR 2024posterarXiv:2303.04761
309
citations
#39

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Yujun Shi, Chuhui Xue, Jun Hao Liew et al.

CVPR 2024highlightarXiv:2306.14435
308
citations
#40

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yihua Huang, Yangtian Sun, Ziyi Yang et al.

CVPR 2024posterarXiv:2312.14937
302
citations
#41

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit et al.

CVPR 2024highlightarXiv:2401.09603
294
citations
#42

ReconFusion: 3D Reconstruction with Diffusion Priors

Rundi Wu, Ben Mildenhall, Philipp Henzler et al.

CVPR 2024posterarXiv:2312.02981
290
citations
#43

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.

CVPR 2024posterarXiv:2312.09147
266
citations
#44

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

CVPR 2024posterarXiv:2312.16256
266
citations
#45

DeepCache: Accelerating Diffusion Models for Free

Xinyin Ma, Gongfan Fang, Xinchao Wang

CVPR 2024posterarXiv:2312.00858
265
citations
#46

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun et al.

CVPR 2024posterarXiv:2311.16518
256
citations
#47

On Scaling Up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski et al.

CVPR 2024posterarXiv:2305.18565
254
citations
#48

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen et al.

CVPR 2024highlightarXiv:2311.18445
244
citations
#49

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge et al.

CVPR 2024posterarXiv:2311.15599
243
citations
#50

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Taoran Yi, Jiemin Fang, Junjie Wang et al.

CVPR 2024posterarXiv:2310.08529
241
citations
#51

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.

CVPR 2024highlightarXiv:2312.00863
241
citations
#52

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Shelly Sheynin, Adam Polyak, Uriel Singer et al.

CVPR 2024highlightarXiv:2311.10089
238
citations
#53

RoMa: Robust Dense Feature Matching

Johan Edstedt, Qiyu Sun, Georg Bökman et al.

CVPR 2024posterarXiv:2305.15404
238
citations
#54

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.

CVPR 2024posterarXiv:2310.11440
237
citations
#55

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

CVPR 2024posterarXiv:2312.10115
236
citations
#56

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

Jiahe Li, Jiawei Zhang, Xiao Bai et al.

CVPR 2024posterarXiv:2403.06912
232
citations
#57

GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces

Yingwenqi Jiang, Jiadong Tu, Yuan Liu et al.

CVPR 2024posterarXiv:2311.17977
231
citations
#58

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang et al.

CVPR 2024poster
230
citations
#59

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

CVPR 2024posterarXiv:2312.00785
230
citations
#60

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

Yufei Wang, Wenhan Yang, Xinyuan Chen et al.

CVPR 2024posterarXiv:2311.14760
226
citations
#61

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo

CVPR 2024highlightarXiv:2312.09008
224
citations
#62

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Le Xue, Ning Yu, Shu Zhang et al.

CVPR 2024posterarXiv:2305.08275
192
citations
#63

GS-IR: 3D Gaussian Splatting for Inverse Rendering

Zhihao Liang, Qi Zhang, Ying Feng et al.

CVPR 2024posterarXiv:2311.16473
188
citations
#64

Putting the Object Back into Video Object Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price et al.

CVPR 2024highlightarXiv:2310.12982
182
citations
#65

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim, Gyojung Gu, Minho Park et al.

CVPR 2024posterarXiv:2312.01725
176
citations
#66

RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

Ximiao Zhang, Min Xu, Xiuzhuang Zhou

CVPR 2024posterarXiv:2403.05897
171
citations
#67

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

CVPR 2024posterarXiv:2311.18482
171
citations
#68

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Zhiqi Li, Zhiding Yu, Shiyi Lan et al.

CVPR 2024posterarXiv:2312.03031
169
citations
#69

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong et al.

CVPR 2024posterarXiv:2311.17984
168
citations
#70

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra, Brandon Huang, Trevor Darrell et al.

CVPR 2024posterarXiv:2311.17076
167
citations
#71

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Wang Zhaolong, Zhuang Li et al.

CVPR 2024posterarXiv:2403.05087
165
citations
#72

BioCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens, Jiaman Wu, Matthew Thompson et al.

CVPR 2024posterarXiv:2311.18803
165
citations
#73

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024posterarXiv:2303.09618
164
citations
#74

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Junjie Wang, Jiemin Fang, Xiaopeng Zhang et al.

CVPR 2024posterarXiv:2311.16037
164
citations
#75

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.

CVPR 2024highlightarXiv:2312.02155
162
citations
#76

Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung, Songwei Ge, Jia-Bin Huang

CVPR 2024posterarXiv:2306.05427
159
citations
#77

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

Zhiwen Yan, Weng Fei Low, Yu Chen et al.

CVPR 2024posterarXiv:2311.17089
159
citations
#78

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Mu Cai, Haotian Liu, Siva Mustikovela et al.

CVPR 2024posterarXiv:2312.00784
153
citations
#79

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

CVPR 2024posterarXiv:2312.10032
149
citations
#80

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Yuelang Xu, Benwang Chen, Zhe Li et al.

CVPR 2024poster
147
citations
#81

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Yifan Wang, Xingyi He, Sida Peng et al.

CVPR 2024highlightarXiv:2403.04765
142
citations
#82

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang et al.

CVPR 2024posterarXiv:2311.17516
141
citations
#83

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang et al.

CVPR 2024highlightarXiv:2312.06739
139
citations
#84

Probing the 3D Awareness of Visual Foundation Models

Mohamed El Banani, Amit Raj, Kevis-kokitsi Maninis et al.

CVPR 2024posterarXiv:2404.08636
138
citations
#85

Optimal Transport Aggregation for Visual Place Recognition

Sergio Izquierdo, Javier Civera

CVPR 2024posterarXiv:2311.15937
138
citations
#86

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.

CVPR 2024posterarXiv:2311.11278
133
citations
#87

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Chunlong Xia, Xinliang Wang, Feng Lv et al.

CVPR 2024highlightarXiv:2403.07392
133
citations
#88

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng et al.

CVPR 2024highlightarXiv:2312.03806
132
citations
#89

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

CVPR 2024posterarXiv:2312.10103
130
citations
#90

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Jiehong Lin, lihua liu, Dekun Lu et al.

CVPR 2024posterarXiv:2311.15707
129
citations
#91

GART: Gaussian Articulated Template Models

Jiahui Lei, Yufu Wang, Georgios Pavlakos et al.

CVPR 2024highlightarXiv:2311.16099
129
citations
#92

Relightable Gaussian Codec Avatars

Shunsuke Saito, Gabriel Schwartz, Tomas Simon et al.

CVPR 2024posterarXiv:2312.03704
128
citations
#93

VLP: Vision Language Planning for Autonomous Driving

Chenbin Pan, Burhan Yaman, Tommaso Nesti et al.

CVPR 2024posterarXiv:2401.05577
127
citations
#94

NeuRAD: Neural Rendering for Autonomous Driving

Adam Tonderski, Carl Lindström, Georg Hess et al.

CVPR 2024highlightarXiv:2311.15260
126
citations
#95

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, HAO ZHANG et al.

CVPR 2024posterarXiv:2403.16387
123
citations
#96

Generalized Predictive Model for Autonomous Driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu et al.

CVPR 2024highlightarXiv:2403.09630
122
citations
#97

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.

CVPR 2024posterarXiv:2403.12580
120
citations
#98

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao et al.

CVPR 2024highlightarXiv:2308.07926
120
citations
#99

EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection

Xuanyu Zhang, Runyi Li, Jiwen Yu et al.

CVPR 2024posterarXiv:2312.08883
118
citations
#100

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

Jinxia Xie, Bineng Zhong, Zhiyi Mo et al.

CVPR 2024poster
118
citations
#101

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang et al.

CVPR 2024posterarXiv:2312.00777
118
citations
#102

Towards Learning a Generalist Model for Embodied Navigation

Duo Zheng, Shijia Huang, Lin Zhao et al.

CVPR 2024highlightarXiv:2312.02010
118
citations
#103

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning

Lingyi Hong, Shilin Yan, Renrui Zhang et al.

CVPR 2024highlightarXiv:2403.09634
118
citations
#104

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Yan-Shuo Liang, Wu-Jun Li

CVPR 2024posterarXiv:2404.00228
117
citations
#105

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Chaoya Jiang, Haiyang Xu, Mengfan Dong et al.

CVPR 2024posterarXiv:2312.06968
116
citations
#106

Human Gaussian Splatting: Real-time Rendering of Animatable Avatars

Arthur Moreau, Jifei Song, Helisa Dhamo et al.

CVPR 2024posterarXiv:2311.17113
116
citations
#107

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu et al.

CVPR 2024posterarXiv:2406.11816
114
citations
#108

PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

Xiaofan Li, Zhizhong Zhang, Xin Tan et al.

CVPR 2024posterarXiv:2404.05231
114
citations
#109

One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong et al.

CVPR 2024highlightarXiv:2312.16145
112
citations
#110

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou, You Li, Fan Ma et al.

CVPR 2024highlightarXiv:2402.05408
112
citations
#111

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang, Chao Feng, Ziyang Chen et al.

CVPR 2024posterarXiv:2401.18084
111
citations
#112

Can I Trust Your Answer? Visually Grounded Video Question Answering

Junbin Xiao, Angela Yao, Yicong Li et al.

CVPR 2024highlightarXiv:2309.01327
109
citations
#113

Efficient Test-Time Adaptation of Vision-Language Models

Adilbek Karmanov, Dayan Guan, Shijian Lu et al.

CVPR 2024posterarXiv:2403.18293
109
citations
#114

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing, Yingqing He, Zeyue Tian et al.

CVPR 2024posterarXiv:2402.17723
109
citations
#115

Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models

Xianfang Zeng, Xin Chen, Zhongqi Qi et al.

CVPR 2024posterarXiv:2312.13913
108
citations
#116

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann et al.

CVPR 2024posterarXiv:2311.14155
107
citations
#117

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Zhen Xing, Qi Dai, Han Hu et al.

CVPR 2024posterarXiv:2308.09710
106
citations
#118

FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization

Jiahui Zhang, Fangneng Zhan, MUYU XU et al.

CVPR 2024posterarXiv:2403.06908
106
citations
#119

OMG-Seg: Is One Model Good Enough For All Segmentation?

Xiangtai Li, Haobo Yuan, Wei Li et al.

CVPR 2024posterarXiv:2401.10229
106
citations
#120

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

CVPR 2024posterarXiv:2312.09866
105
citations
#121

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng DENG et al.

CVPR 2024posterarXiv:2304.00962
104
citations
#122

Zero-Reference Low-Light Enhancement via Physical Quadruple Priors

Wenjing Wang, Huan Yang, Jianlong Fu et al.

CVPR 2024posterarXiv:2403.12933
101
citations
#123

3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu, xiang guo, Le Hui et al.

CVPR 2024posterarXiv:2404.06270
101
citations
#124

Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

ZHIXIANG WEI, Lin Chen, Xiaoxiao Ma et al.

CVPR 2024posterarXiv:2312.04265
100
citations
#125

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro

CVPR 2024posterarXiv:2401.16456
100
citations
#126

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Rafail Fridman, Danah Yatim, Omer Bar-Tal et al.

CVPR 2024posterarXiv:2311.17009
98
citations
#127

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Chuwei Luo, Yufan Shen, Zhaoqing Zhu et al.

CVPR 2024posterarXiv:2404.05225
98
citations
#128

HIPTrack: Visual Tracking with Historical Prompts

Wenrui Cai, Qingjie Liu, Yunhong Wang

CVPR 2024posterarXiv:2311.02072
96
citations
#129

Boosting Adversarial Transferability by Block Shuffle and Rotation

Kunyu Wang, he xuanran, Wenxuan Wang et al.

CVPR 2024posterarXiv:2308.10299
96
citations
#130

GARField: Group Anything with Radiance Fields

Chung Min Kim, Mingxuan Wu, Justin Kerr et al.

CVPR 2024posterarXiv:2401.09419
96
citations
#131

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Bin Xie, Jiale Cao, Jin Xie et al.

CVPR 2024posterarXiv:2311.15537
96
citations
#132

Single-Model and Any-Modality for Video Object Tracking

Zongwei Wu, Jilai Zheng, Xiangxuan Ren et al.

CVPR 2024posterarXiv:2311.15851
96
citations
#133

Self-correcting LLM-controlled Diffusion Models

Tsung-Han Wu, Long Lian, Joseph Gonzalez et al.

CVPR 2024posterarXiv:2311.16090
95
citations
#134

Residual Denoising Diffusion Models

Jiawei Liu, Qiang Wang, Huijie Fan et al.

CVPR 2024posterarXiv:2308.13712
93
citations
#135

Generative Image Dynamics

Zhengqi Li, Richard Tucker, Noah Snavely et al.

CVPR 2024posterarXiv:2309.07906
93
citations
#136

Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields

Leili Goli, Cody Reading, Silvia Sellán et al.

CVPR 2024highlightarXiv:2309.03185
89
citations
#137

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Chong Mou, Xintao Wang, Jiechong Song et al.

CVPR 2024posterarXiv:2402.02583
89
citations
#138

VidToMe: Video Token Merging for Zero-Shot Video Editing

Xirui Li, Chao Ma, Xiaokang Yang et al.

CVPR 2024posterarXiv:2312.10656
89
citations
#139

AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error

Jonas Ricker, Denis Lukovnikov, Asja Fischer

CVPR 2024posterarXiv:2401.17879
89
citations
#140

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Sihan liu, Yiwei Ma, Xiaoqing Zhang et al.

CVPR 2024posterarXiv:2312.12470
89
citations
#141

GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering

Abdullah J Hamdi, Luke Melas-Kyriazi, Jinjie Mai et al.

CVPR 2024posterarXiv:2402.10128
88
citations
#142

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.

CVPR 2024posterarXiv:2311.17049
87
citations
#143

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Gege Gao, Weiyang Liu, Anpei Chen et al.

CVPR 2024posterarXiv:2312.00093
87
citations
#144

CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang et al.

CVPR 2024posterarXiv:2307.12732
86
citations
#145

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li, Tanmay Shah et al.

CVPR 2024posterarXiv:2310.17994
85
citations
#146

DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models

Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood et al.

CVPR 2024posterarXiv:2405.14881
85
citations
#147

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Yifei Huang, Guo Chen, Jilan Xu et al.

CVPR 2024posterarXiv:2403.16182
84
citations
#148

Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan et al.

CVPR 2024posterarXiv:2406.11824
84
citations
#149

TUMTraf V2X Cooperative Perception Dataset

Walter Zimmer, Gerhard Arya Wardana, Suren Sritharan et al.

CVPR 2024posterarXiv:2403.01316
83
citations
#150

Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining

Xiang Chen, Jinshan Pan, Jiangxin Dong

CVPR 2024posterarXiv:2404.01547
83
citations
#151

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal et al.

CVPR 2024posterarXiv:2403.05419
83
citations
#152

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

CVPR 2024posterarXiv:2402.13250
82
citations
#153

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

Xingqian Xu, Jiayi Guo, Zhangyang Wang et al.

CVPR 2024posterarXiv:2305.16223
82
citations
#154

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

Xiao Wang, Shiao Wang, Chuanming Tang et al.

CVPR 2024posterarXiv:2309.14611
82
citations
#155

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Xiefan Guo, Jinlin Liu, Miaomiao Cui et al.

CVPR 2024posterarXiv:2404.04650
81
citations
#156

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Junbo Yin, Wenguan Wang, Runnan Chen et al.

CVPR 2024highlightarXiv:2403.15241
81
citations
#157

HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

Yuheng Jiang, Zhehao Shen, Penghao Wang et al.

CVPR 2024posterarXiv:2312.03461
81
citations
#158

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

CVPR 2024highlightarXiv:2312.09158
81
citations
#159

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024posterarXiv:2312.03777
80
citations
#160

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu et al.

CVPR 2024posterarXiv:2405.14705
80
citations
#161

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova et al.

CVPR 2024posterarXiv:2307.07511
80
citations
#162

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Xiuquan Hou, Meiqin Liu, Senlin Zhang et al.

CVPR 2024posterarXiv:2403.16131
80
citations
#163

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.

CVPR 2024posterarXiv:2312.12490
80
citations
#164

Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology

Andrew Song, Richard J. Chen, Tong Ding et al.

CVPR 2024posterarXiv:2405.11643
79
citations
#165

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Feng Lu, Xiangyuan Lan, Lijun Zhang et al.

CVPR 2024posterarXiv:2402.19231
79
citations
#166

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Ruoyu Feng, Wenming Weng, Yanhui Wang et al.

CVPR 2024posterarXiv:2309.16496
79
citations
#167

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Yutong Feng, Biao Gong, Di Chen et al.

CVPR 2024posterarXiv:2311.17002
78
citations
#168

FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding

Jun Xiang, Xuan Gao, Yudong Guo et al.

CVPR 2024posterarXiv:2312.02214
78
citations
#169

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia et al.

CVPR 2024highlightarXiv:2403.18036
78
citations
#170

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini et al.

CVPR 2024posterarXiv:2401.00374
78
citations
#171

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

Xiaoyang Wu, Zhuotao Tian, Xin Wen et al.

CVPR 2024posterarXiv:2308.09718
77
citations
#172

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

CVPR 2024posterarXiv:2402.16846
76
citations
#173

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Linshan Wu, Jia-Xin Zhuang, Hao Chen

CVPR 2024posterarXiv:2402.17300
76
citations
#174

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Yiran Qin, Enshen Zhou, Qichang Liu et al.

CVPR 2024posterarXiv:2312.07472
76
citations
#175

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang, Yawei Li, Xingyu Zhou et al.

CVPR 2024posterarXiv:2401.08209
76
citations
#176

SAI3D: Segment Any Instance in 3D Scenes

Yingda Yin, Yuzheng Liu, Yang Xiao et al.

CVPR 2024posterarXiv:2312.11557
76
citations
#177

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll

CVPR 2024posterarXiv:2312.11360
75
citations
#178

Structure-Aware Sparse-View X-ray 3D Reconstruction

Yuanhao Cai, Jiahao Wang, Alan L. Yuille et al.

CVPR 2024posterarXiv:2311.10959
75
citations
#179

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

Jiawen Li, Yuxuan Chen, Hongbo Chu et al.

CVPR 2024posterarXiv:2403.07719
75
citations
#180

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

Haoze Sun, Wenbo Li, Jianzhuang Liu et al.

CVPR 2024posterarXiv:2311.16512
74
citations
#181

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024posterarXiv:2312.00878
74
citations
#182

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Qihang Ma, Xin Tan, Yanyun Qu et al.

CVPR 2024posterarXiv:2312.01919
73
citations
#183

Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing

Yafei Zhang, Shen Zhou, Huafeng Li

CVPR 2024posterarXiv:2403.01105
73
citations
#184

LLaFS: When Large Language Models Meet Few-Shot Segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji et al.

CVPR 2024posterarXiv:2311.16926
73
citations
#185

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.

CVPR 2024posterarXiv:2405.12979
72
citations
#186

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Jiayu Yang, Ziang Cheng, Yunfei Duan et al.

CVPR 2024posterarXiv:2310.10343
72
citations
#187

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

Ziyang Chen, Wei Long, He Yao et al.

CVPR 2024posterarXiv:2404.06842
72
citations
#188

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Evonne Ng, Javier Romero, Timur Bagautdinov et al.

CVPR 2024posterarXiv:2401.01885
71
citations
#189

MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception

Thien-Minh Nguyen, Shenghai Yuan, Thien Nguyen et al.

CVPR 2024highlightarXiv:2403.11496
70
citations
#190

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Ye Yuan, Xueting Li, Yangyi Huang et al.

CVPR 2024highlightarXiv:2312.11461
70
citations
#191

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

CVPR 2024posterarXiv:2404.08531
70
citations
#192

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

Moreno D&#x27, Incà, Elia Peruzzo et al.

CVPR 2024highlight
69
citations
#193

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Hsuan-I Ho, Jie Song, Otmar Hilliges

CVPR 2024posterarXiv:2311.15855
69
citations
#194

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang et al.

CVPR 2024highlightarXiv:2403.17998
69
citations
#195

Towards Realistic Scene Generation with LiDAR Diffusion Models

Haoxi Ran, Vitor Guizilini, Yue Wang

CVPR 2024posterarXiv:2404.00815
68
citations
#196

Optimizing Diffusion Noise Can Serve As Universal Motion Priors

Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan et al.

CVPR 2024posterarXiv:2312.11994
68
citations
#197

Free3D: Consistent Novel View Synthesis without 3D Representation

Chuanxia Zheng, Andrea Vedaldi

CVPR 2024posterarXiv:2312.04551
68
citations
#198

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Jiawang Bai, Kuofeng Gao, Shaobo Min et al.

CVPR 2024posterarXiv:2311.16194
68
citations
#199

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Zhixuan Liang, Yao Mu, Hengbo Ma et al.

CVPR 2024posterarXiv:2312.11598
67
citations
#200

Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic

Sachin Goyal, Pratyush Maini, Zachary Lipton et al.

CVPR 2024posterarXiv:2404.07177
67
citations
PreviousNext