Most Cited CVPR "action sequence similarity" Papers

5,589 papers found • Page 3 of 28

#401

SAI3D: Segment Any Instance in 3D Scenes

Yingda Yin, Yuzheng Liu, Yang Xiao et al.

CVPR 2024arXiv:2312.11557
79
citations
#402

HVI: A New Color Space for Low-light Image Enhancement

Qingsen Yan, Yixu Feng, Cheng Zhang et al.

CVPR 2025arXiv:2502.20272
79
citations
#403

FairCLIP: Harnessing Fairness in Vision-Language Learning

Yan Luo, MIN SHI, Muhammad Osama Khan et al.

CVPR 2024arXiv:2403.19949
79
citations
#404

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll

CVPR 2024arXiv:2312.11360
78
citations
#405

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

CVPR 2025arXiv:2501.12380
78
citations
#406

Inter-X: Towards Versatile Human-Human Interaction Analysis

Liang Xu, Xintao Lv, Yichao Yan et al.

CVPR 2024arXiv:2312.16051
78
citations
#407

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Qihang Ma, Xin Tan, Yanyun Qu et al.

CVPR 2024arXiv:2312.01919
78
citations
#408

LLaFS: When Large Language Models Meet Few-Shot Segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji et al.

CVPR 2024arXiv:2311.16926
78
citations
#409

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

Xiaoyang Wu, Zhuotao Tian, Xin Wen et al.

CVPR 2024arXiv:2308.09718
78
citations
#410

Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini, Mustafa Shukor, Xiujun Li et al.

CVPR 2025highlightarXiv:2411.14402
77
citations
#411

Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model

Dian Zheng, Xiao-Ming Wu, Shuzhou Yang et al.

CVPR 2024arXiv:2403.11157
77
citations
#412

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

Zechuan Zhang, Zongxin Yang, Yi Yang

CVPR 2024highlightarXiv:2312.06704
77
citations
#413

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

CVPR 2025highlightarXiv:2412.07774
77
citations
#414

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Yiran Qin, Enshen Zhou, Qichang Liu et al.

CVPR 2024arXiv:2312.07472
77
citations
#415

Learning Vision from Models Rivals Learning Vision from Data

Yonglong Tian, Lijie Fan, Kaifeng Chen et al.

CVPR 2024arXiv:2312.17742
77
citations
#416

Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su et al.

CVPR 2024arXiv:2401.01952
77
citations
#417

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024arXiv:2312.00878
76
citations
#418

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang, Yawei Li, Xingyu Zhou et al.

CVPR 2024arXiv:2401.08209
76
citations
#419

Streaming Dense Video Captioning

Xingyi Zhou, Anurag Arnab, Shyamal Buch et al.

CVPR 2024arXiv:2404.01297
76
citations
#420

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

Yijun Yang, Tianyi Zhou, kanxue Li et al.

CVPR 2024arXiv:2311.16714
76
citations
#421

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu et al.

CVPR 2024arXiv:2312.03052
76
citations
#422

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Linshan Wu, Jia-Xin Zhuang, Hao Chen

CVPR 2024arXiv:2402.17300
76
citations
#423

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

CVPR 2024arXiv:2402.16846
76
citations
#424

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

CVPR 2024arXiv:2404.08531
75
citations
#425

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong et al.

CVPR 2024arXiv:2309.00610
75
citations
#426

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

Haoze Sun, Wenbo Li, Jianzhuang Liu et al.

CVPR 2024arXiv:2311.16512
75
citations
#427

A Unified Approach for Text- and Image-guided 4D Scene Generation

Yufeng Zheng, Xueting Li, Koki Nagano et al.

CVPR 2024arXiv:2311.16854
75
citations
#428

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.

CVPR 2025arXiv:2412.14167
75
citations
#429

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Sai Kumar Dwivedi, Yu Sun, Priyanka Patel et al.

CVPR 2024arXiv:2404.16752
75
citations
#430

MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception

Thien-Minh Nguyen, Shenghai Yuan, Thien Nguyen et al.

CVPR 2024highlightarXiv:2403.11496
75
citations
#431

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Trevine Oorloff, Surya Koppisetti, Nicolo Bonettini et al.

CVPR 2024arXiv:2406.02951
74
citations
#432

GenTron: Diffusion Transformers for Image and Video Generation

Shoufa Chen, Mengmeng Xu, Jiawei Ren et al.

CVPR 2024arXiv:2312.04557
74
citations
#433

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Hsuan-I Ho, Jie Song, Otmar Hilliges

CVPR 2024arXiv:2311.15855
74
citations
#434

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

CVPR 2025arXiv:2502.21271
73
citations
#435

LLMs are Good Sign Language Translators

Jia Gong, Lin Geng Foo, Yixuan He et al.

CVPR 2024arXiv:2404.00925
73
citations
#436

LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection

Yunpeng Luo, Junlong Du, Ke Yan et al.

CVPR 2024arXiv:2403.17465
73
citations
#437

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Jiayu Yang, Ziang Cheng, Yunfei Duan et al.

CVPR 2024arXiv:2310.10343
73
citations
#438

RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo, Shalini De Mello, Danny Yin et al.

CVPR 2024arXiv:2403.02330
73
citations
#439

Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing

Yafei Zhang, Shen Zhou, Huafeng Li

CVPR 2024arXiv:2403.01105
73
citations
#440

VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens

Fan Ma, Xiaojie Jin, Heng Wang et al.

CVPR 2024arXiv:2312.08870
73
citations
#441

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

Ziyang Chen, Wei Long, He Yao et al.

CVPR 2024arXiv:2404.06842
73
citations
#442

ChatPose: Chatting about 3D Human Pose

Yao Feng, Jing Lin, Sai Kumar Dwivedi et al.

CVPR 2024arXiv:2311.18836
72
citations
#443

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology

Wenhao Tang, Fengtao ZHOU, Sheng Huang et al.

CVPR 2024arXiv:2402.17228
72
citations
#444

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

Moreno D&#x27, Incà, Elia Peruzzo et al.

CVPR 2024highlightarXiv:2404.07990
72
citations
#445

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani et al.

CVPR 2024arXiv:2404.06609
72
citations
#446

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.

CVPR 2024arXiv:2405.12979
72
citations
#447

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Evonne Ng, Javier Romero, Timur Bagautdinov et al.

CVPR 2024arXiv:2401.01885
72
citations
#448

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

Zehuan Huang, Hao Wen, Junting Dong et al.

CVPR 2024arXiv:2312.06725
72
citations
#449

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Honghui Yang, Sha Zhang, Di Huang et al.

CVPR 2024arXiv:2310.08370
72
citations
#450

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Ye Yuan, Xueting Li, Yangyi Huang et al.

CVPR 2024highlightarXiv:2312.11461
71
citations
#451

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Chen Min, Dawei Zhao, Liang Xiao et al.

CVPR 2024arXiv:2405.04390
71
citations
#452

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

Litu Rout, Yujia Chen, Abhishek Kumar et al.

CVPR 2024arXiv:2312.00852
71
citations
#453

Towards Generalizable Tumor Synthesis

Qi Chen, Xiaoxi Chen, Haorui Song et al.

CVPR 2024arXiv:2402.19470
71
citations
#454

Exploiting Style Latent Flows for Generalizing Deepfake Video Detection

Jongwook Choi, Taehoon Kim, Yonghyun Jeong et al.

CVPR 2024arXiv:2403.06592
71
citations
#455

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

Guofeng Feng, Siyan Chen, Rong Fu et al.

CVPR 2025arXiv:2408.07967
71
citations
#456

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Zhihao Yuan, Jinke Ren, Chun-Mei Feng et al.

CVPR 2024arXiv:2311.15383
70
citations
#457

Harnessing Large Language Models for Training-free Video Anomaly Detection

Luca Zanella, Willi Menapace, Massimiliano Mancini et al.

CVPR 2024arXiv:2404.01014
70
citations
#458

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, Liwei Wang

CVPR 2025arXiv:2412.00493
70
citations
#459

MindBridge: A Cross-Subject Brain Decoding Framework

Shizun Wang, Songhua Liu, Zhenxiong Tan et al.

CVPR 2024highlightarXiv:2404.07850
70
citations
#460

Detector-Free Structure from Motion

Xingyi He, Jiaming Sun, Yifan Wang et al.

CVPR 2024arXiv:2306.15669
70
citations
#461

Task-Customized Mixture of Adapters for General Image Fusion

Pengfei Zhu, Yang Sun, Bing Cao et al.

CVPR 2024arXiv:2403.12494
69
citations
#462

Open-Vocabulary Video Anomaly Detection

Peng Wu, Xuerong Zhou, Guansong Pang et al.

CVPR 2024arXiv:2311.07042
69
citations
#463

Aligning and Prompting Everything All at Once for Universal Visual Perception

Yunhang Shen, Chaoyou Fu, Peixian Chen et al.

CVPR 2024arXiv:2312.02153
69
citations
#464

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Höllein, Aljaž Božič, Norman Müller et al.

CVPR 2024arXiv:2403.01807
69
citations
#465

Free3D: Consistent Novel View Synthesis without 3D Representation

Chuanxia Zheng, Andrea Vedaldi

CVPR 2024arXiv:2312.04551
69
citations
#466

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Yunsheng Ma, Can Cui, Xu Cao et al.

CVPR 2024arXiv:2312.04372
69
citations
#467

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang et al.

CVPR 2024highlightarXiv:2403.17998
69
citations
#468

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

Kaiwen Zhang, Yifan Zhou, Xudong XU et al.

CVPR 2024arXiv:2312.07409
69
citations
#469

AVID: Any-Length Video Inpainting with Diffusion Model

Zhixing Zhang, Bichen Wu, Xiaoyan Wang et al.

CVPR 2024arXiv:2312.03816
69
citations
#470

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li, wei yuancheng, Zhihui Xie et al.

CVPR 2025highlightarXiv:2411.17451
69
citations
#471

FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions

Zhen Liu, Hao Zhu, Qi Zhang et al.

CVPR 2024arXiv:2312.02434
69
citations
#472

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

Honghao Chen, Xiangxiang Chu, Renyongjian et al.

CVPR 2024arXiv:2403.07589
68
citations
#473

Optimizing Diffusion Noise Can Serve As Universal Motion Priors

Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan et al.

CVPR 2024arXiv:2312.11994
68
citations
#474

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Jiawang Bai, Kuofeng Gao, Shaobo Min et al.

CVPR 2024arXiv:2311.16194
68
citations
#475

Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic

Sachin Goyal, Pratyush Maini, Zachary Lipton et al.

CVPR 2024arXiv:2404.07177
68
citations
#476

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025arXiv:2410.08202
68
citations
#477

Towards Realistic Scene Generation with LiDAR Diffusion Models

Haoxi Ran, Vitor Guizilini, Yue Wang

CVPR 2024arXiv:2404.00815
68
citations
#478

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Yongting Zhang, Lu Chen, Guodong Zheng et al.

CVPR 2025arXiv:2406.12030
68
citations
#479

OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

Haiyang Ying, Yixuan Yin, Jinzhi Zhang et al.

CVPR 2024arXiv:2311.11666
68
citations
#480

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Chang Liu, Haoning Wu, Yujie Zhong et al.

CVPR 2024arXiv:2306.00973
67
citations
#481

An Aggregation-Free Federated Learning for Tackling Data Heterogeneity

Yuan Wang, Huazhu Fu, Renuga Kanagavelu et al.

CVPR 2024arXiv:2404.18962
67
citations
#482

One-Minute Video Generation with Test-Time Training

Jiarui Xu, Shihao Han, Karan Dalal et al.

CVPR 2025arXiv:2504.05298
67
citations
#483

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

Shanshan Zhong, Zhongzhan Huang, Shanghua Gao et al.

CVPR 2024arXiv:2312.02439
67
citations
#484

Gaussian Shell Maps for Efficient 3D Human Generation

Rameen Abdal, Wang Yifan, Zifan Shi et al.

CVPR 2024arXiv:2311.17857
67
citations
#485

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Feng Liang, Bichen Wu, Jialiang Wang et al.

CVPR 2024highlightarXiv:2312.17681
67
citations
#486

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Juhong Min, Shyamal Buch, Arsha Nagrani et al.

CVPR 2024arXiv:2404.06511
67
citations
#487

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

Xinyu Shi, Zecheng Hao, Zhaofei Yu

CVPR 2024arXiv:2403.14302
67
citations
#488

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Zhixuan Liang, Yao Mu, Hengbo Ma et al.

CVPR 2024arXiv:2312.11598
67
citations
#489

CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation

Bo-Yuan Sun, Yuqi Yang, Le Zhang et al.

CVPR 2024arXiv:2306.04300
67
citations
#490

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

Haipeng Liu, Yang Wang, Biao Qian et al.

CVPR 2024arXiv:2403.19898
66
citations
#491

MonoCD: Monocular 3D Object Detection with Complementary Depths

Longfei Yan, Pei Yan, Shengzhou Xiong et al.

CVPR 2024arXiv:2404.03181
66
citations
#492

Video Interpolation with Diffusion Models

Siddhant Jain, Daniel Watson, Aleksander Holynski et al.

CVPR 2024arXiv:2404.01203
66
citations
#493

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Zhenglin Huang, Jinwei Hu, Yiwei He et al.

CVPR 2025arXiv:2412.04292
66
citations
#494

Vlogger: Make Your Dream A Vlog

Shaobin Zhuang, Kunchang Li, Xinyuan Chen et al.

CVPR 2024arXiv:2401.09414
66
citations
#495

Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving

JINLONG LI, Baolu Li, Zhengzhong Tu et al.

CVPR 2024arXiv:2404.04804
66
citations
#496

NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Weining Ren, Zihan Zhu, Boyang Sun et al.

CVPR 2024arXiv:2405.18715
66
citations
#497

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Hang Li, Chengzhi Shen, Philip H.S. Torr et al.

CVPR 2024arXiv:2311.17216
66
citations
#498

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

Yushi Huang, Ruihao Gong, Jing Liu et al.

CVPR 2024highlightarXiv:2311.16503
66
citations
#499

Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships

Sebastian Koch, Narunas Vaskevicius, Mirco Colosi et al.

CVPR 2024arXiv:2402.12259
65
citations
#500

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

Song Tang, Wenxin Su, Mao Ye et al.

CVPR 2024arXiv:2311.16510
65
citations
#501

EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos Vougioukas et al.

CVPR 2024arXiv:2404.19110
65
citations
#502

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Oindrila Saha, Grant Horn, Subhransu Maji

CVPR 2024arXiv:2401.02460
65
citations
#503

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.

CVPR 2024arXiv:2312.12730
65
citations
#504

Task Singular Vectors: Reducing Task Interference in Model Merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.

CVPR 2025arXiv:2412.00081
64
citations
#505

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Shuyuan Tu, Zhen Xing, Xintong Han et al.

CVPR 2025arXiv:2411.17697
64
citations
#506

ZONE: Zero-Shot Instruction-Guided Local Editing

Shanglin Li, Bohan Zeng, Yutang Feng et al.

CVPR 2024arXiv:2312.16794
64
citations
#507

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Zhangyang Qi, Ye Fang, Zeyi Sun et al.

CVPR 2024highlightarXiv:2312.02980
64
citations
#508

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu et al.

CVPR 2025arXiv:2412.05435
64
citations
#509

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

Zhiheng Cheng, Qingyue Wei, Hongru Zhu et al.

CVPR 2024arXiv:2403.18271
64
citations
#510

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Shuting He, Henghui Ding

CVPR 2024arXiv:2404.03645
64
citations
#511

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Yizhi Song, Zhifei Zhang, Zhe Lin et al.

CVPR 2024arXiv:2403.10701
64
citations
#512

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

guo, Tianwei Lin

CVPR 2024arXiv:2312.10113
64
citations
#513

Koala: Key Frame-Conditioned Long Video-LLM

Reuben Tan, Ximeng Sun, Ping Hu et al.

CVPR 2024highlightarXiv:2404.04346
64
citations
#514

Efficient Dataset Distillation via Minimax Diffusion

Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev et al.

CVPR 2024arXiv:2311.15529
64
citations
#515

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Xiaolong Tang, Meina Kan, Shiguang Shan et al.

CVPR 2024arXiv:2404.06351
63
citations
#516

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Hongchi Xia, Yang Fu, Sifei Liu et al.

CVPR 2024arXiv:2401.12592
63
citations
#517

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.

CVPR 2025arXiv:2408.00653
63
citations
#518

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

Jeong-gi Kwak, Erqun Dong, Yuhe Jin et al.

CVPR 2024highlightarXiv:2312.01305
63
citations
#519

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed et al.

CVPR 2024arXiv:2404.07449
63
citations
#520

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.

CVPR 2025arXiv:2412.01827
63
citations
#521

3D Facial Expressions through Analysis-by-Neural-Synthesis

George Retsinas, Panagiotis Filntisis, Radek Danecek et al.

CVPR 2024arXiv:2404.04104
63
citations
#522

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

Jisu Nam, Heesu Kim, DongJae Lee et al.

CVPR 2024arXiv:2402.09812
63
citations
#523

SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

Gang Zhang, Chen Junnan, Guohuan Gao et al.

CVPR 2024arXiv:2403.05817
63
citations
#524

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong et al.

CVPR 2025highlightarXiv:2409.12957
63
citations
#525

C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video

Hyunjik Kim, Matthias Bauer, Lucas Theis et al.

CVPR 2024arXiv:2312.02753
63
citations
#526

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.

CVPR 2025arXiv:2501.08331
62
citations
#527

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Yuchao Gu, Yipin Zhou, Bichen Wu et al.

CVPR 2024arXiv:2312.02087
62
citations
#528

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Junming Chen, Yunfei Liu, Jianan Wang et al.

CVPR 2024arXiv:2401.04747
62
citations
#529

Volumetric Environment Representation for Vision-Language Navigation

Liu, Wenguan Wang, Yi Yang

CVPR 2024highlightarXiv:2403.14158
62
citations
#530

pix2gestalt: Amodal Segmentation by Synthesizing Wholes

Ege Ozguroglu, Ruoshi Liu, Dídac Surís et al.

CVPR 2024highlightarXiv:2401.14398
62
citations
#531

SVGDreamer: Text Guided SVG Generation with Diffusion Model

XiMing Xing, Chuang Wang, Haitao Zhou et al.

CVPR 2024arXiv:2312.16476
62
citations
#532

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

Hoon Kim, Minje Jang, Wonjun Yoon et al.

CVPR 2024highlightarXiv:2402.18848
62
citations
#533

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao et al.

CVPR 2024highlightarXiv:2311.06612
62
citations
#534

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon et al.

CVPR 2024arXiv:2404.00562
62
citations
#535

DePT: Decoupled Prompt Tuning

Ji Zhang, Shihan Wu, Lianli Gao et al.

CVPR 2024arXiv:2309.07439
62
citations
#536

BoQ: A Place is Worth a Bag of Learnable Queries

Amar Ali-bey, Brahim Chaib-draa, Philippe Giguère

CVPR 2024arXiv:2405.07364
62
citations
#537

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

Bohao Peng, Xiaoyang Wu, Li Jiang et al.

CVPR 2024arXiv:2403.14418
61
citations
#538

The Neglected Tails in Vision-Language Models

Shubham Parashar, Tian Liu, Zhiqiu Lin et al.

CVPR 2024arXiv:2401.12425
61
citations
#539

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

Weiyi Lv, Yuhang Huang, NING Zhang et al.

CVPR 2024arXiv:2403.02075
61
citations
#540

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet et al.

CVPR 2024arXiv:2312.07509
61
citations
#541

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.

CVPR 2025arXiv:2503.01774
61
citations
#542

ControlRoom3D: Room Generation using Semantic Proxy Rooms

Jonas Schult, Sam Tsai, Lukas Höllein et al.

CVPR 2024arXiv:2312.05208
61
citations
#543

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.

CVPR 2025highlightarXiv:2503.01661
61
citations
#544

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus, Kian Kenyon-Dean, Saber Saberian et al.

CVPR 2024highlightarXiv:2404.10242
61
citations
#545

Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation

Hyunwoo Ryu, Jiwoo Kim, Hyunseok An et al.

CVPR 2024highlightarXiv:2309.02685
61
citations
#546

MuRF: Multi-Baseline Radiance Fields

Haofei Xu, Anpei Chen, Yuedong Chen et al.

CVPR 2024arXiv:2312.04565
61
citations
#547

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj Jha

CVPR 2024arXiv:2305.17328
61
citations
#548

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin, Richard Tucker, Zhengqi Li et al.

CVPR 2025arXiv:2412.09621
60
citations
#549

Transcriptomics-guided Slide Representation Learning in Computational Pathology

Guillaume Jaume, Lukas Oldenburg, Anurag Vaidya et al.

CVPR 2024arXiv:2405.11618
60
citations
#550

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Yuqi Yang, Peng-Tao Jiang, Qibin Hou et al.

CVPR 2024arXiv:2403.17749
60
citations
#551

Loopy-SLAM: Dense Neural SLAM with Loop Closures

Lorenzo Liso, Erik Sandström, Vladimir Yugay et al.

CVPR 2024arXiv:2402.09944
60
citations
#552

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025highlightarXiv:2405.17220
60
citations
#553

Stable Flow: Vital Layers for Training-Free Image Editing

Omri Avrahami, Or Patashnik, Ohad Fried et al.

CVPR 2025arXiv:2411.14430
60
citations
#554

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang et al.

CVPR 2025arXiv:2406.12275
60
citations
#555

Seamless Human Motion Composition with Blended Positional Encodings

German Barquero, Sergio Escalera, Cristina Palmero

CVPR 2024arXiv:2402.15509
60
citations
#556

OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion

Xinyu Zhan, Lixin Yang, Yifei Zhao et al.

CVPR 2024arXiv:2403.19417
60
citations
#557

DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting

Demin Yu, Xutao Li, Yunming Ye et al.

CVPR 2024arXiv:2312.06734
60
citations
#558

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Jihua Peng, Yanghong Zhou, Tracy P Y Mok

CVPR 2024arXiv:2404.00658
60
citations
#559

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Yao Mu, Tianxing Chen, Zanxin Chen et al.

CVPR 2025highlightarXiv:2504.13059
60
citations
#560

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Shuyuan Tu, Qi Dai, Zhi-Qi Cheng et al.

CVPR 2024arXiv:2311.18830
60
citations
#561

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

Devikalyan Das, Christopher Wewer, Raza Yunus et al.

CVPR 2024arXiv:2312.01196
60
citations
#562

Driving Everywhere with Large Language Model Policy Adaptation

Boyi Li, Yue Wang, Jiageng Mao et al.

CVPR 2024arXiv:2402.05932
59
citations
#563

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Edward LOO, Tianyu HUANG, Peng Li et al.

CVPR 2025highlightarXiv:2412.03079
59
citations
#564

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

Inhwan Bae, Junoh Lee, Hae-Gon Jeon

CVPR 2024arXiv:2403.18447
59
citations
#565

Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens

Zhangqi Jiang, Junkai Chen, Beier Zhu et al.

CVPR 2025arXiv:2411.16724
59
citations
#566

Orthogonal Adaptation for Modular Customization of Diffusion Models

Ryan Po, Guandao Yang, Kfir Aberman et al.

CVPR 2024highlightarXiv:2312.02432
59
citations
#567

NeRFiller: Completing Scenes via Generative 3D Inpainting

Ethan Weber, Aleksander Holynski, Varun Jampani et al.

CVPR 2024arXiv:2312.04560
59
citations
#568

Point Cloud Pre-training with Diffusion Models

xiao zheng, Xiaoshui Huang, Guofeng Mei et al.

CVPR 2024arXiv:2311.14960
59
citations
#569

Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention

Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim et al.

CVPR 2024arXiv:2405.06284
59
citations
#570

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun, Runjia Li, Philip H.S. Torr et al.

CVPR 2024arXiv:2312.07661
59
citations
#571

Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution

Guangyuan Li, Chen Rao, Juncheng Mo et al.

CVPR 2024arXiv:2404.04785
59
citations
#572

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

CVPR 2024arXiv:2306.12041
59
citations
#573

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

Yanzuo Lu, Manlin Zhang, Jinhua Ma et al.

CVPR 2024highlightarXiv:2402.18078
58
citations
#574

Bilateral Propagation Network for Depth Completion

Jie Tang, Fei-Peng Tian, Boshi An et al.

CVPR 2024arXiv:2403.11270
58
citations
#575

Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation

Hanyang Chi, Jian Pang, Bingfeng Zhang et al.

CVPR 2024arXiv:2405.00378
58
citations
#576

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang et al.

CVPR 2025arXiv:2411.19548
58
citations
#577

RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception

Ruiyang Hao, Siqi Fan, Yingru Dai et al.

CVPR 2024arXiv:2403.10145
58
citations
#578

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Hao Li, Xue Yang, Zhaokai Wang et al.

CVPR 2024arXiv:2312.09238
58
citations
#579

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng et al.

CVPR 2025arXiv:2409.12259
58
citations
#580

Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance

Tomer Garber, Tom Tirer

CVPR 2024arXiv:2312.16519
58
citations
#581

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentao Tan, Changxing Ding, Jiayu Jiang et al.

CVPR 2024arXiv:2405.04940
58
citations
#582

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang et al.

CVPR 2024arXiv:2312.04483
57
citations
#583

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

Lei Chen, Yuan Meng, Chen Tang et al.

CVPR 2025arXiv:2406.17343
57
citations
#584

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi et al.

CVPR 2025arXiv:2411.14794
57
citations
#585

Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching

Shitong Shao, Zeyuan Yin, Muxin Zhou et al.

CVPR 2024highlightarXiv:2311.17950
57
citations
#586

FedAS: Bridging Inconsistency in Personalized Federated Learning

Xiyuan Yang, Wenke Huang, Mang Ye

CVPR 2024
57
citations
#587

MemFlow: Optical Flow Estimation and Prediction with Memory

Qiaole Dong, Yanwei Fu

CVPR 2024arXiv:2404.04808
57
citations
#588

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang et al.

CVPR 2025highlightarXiv:2412.09401
57
citations
#589

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.

CVPR 2025arXiv:2503.02175
57
citations
#590

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

Yun Liu, Haolin Yang, Xu Si et al.

CVPR 2024arXiv:2401.08399
56
citations
#591

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang et al.

CVPR 2025arXiv:2405.14325
56
citations
#592

Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives

Ronghui Li, Yuxiang Zhang, Yachao Zhang et al.

CVPR 2024arXiv:2403.10518
56
citations
#593

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Zicong Fan, Maria Parelli, Maria Kadoglou et al.

CVPR 2024highlightarXiv:2311.18448
56
citations
#594

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

Jianhao Zeng, Dan Song, Weizhi Nie et al.

CVPR 2024arXiv:2311.18405
56
citations
#595

Intrinsic Image Diffusion for Indoor Single-view Material Estimation

Peter Kocsis, Vincent Sitzmann, Matthias Nießner

CVPR 2024arXiv:2312.12274
56
citations
#596

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon et al.

CVPR 2025highlightarXiv:2411.19167
56
citations
#597

Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D

Karran Pandey, Paul Guerrero, Matheus Gadelha et al.

CVPR 2024highlightarXiv:2312.02190
56
citations
#598

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei et al.

CVPR 2025arXiv:2412.12507
56
citations
#599

Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification

kaijie ren, Lei Zhang

CVPR 2024arXiv:2403.11708
56
citations
#600

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding et al.

CVPR 2024arXiv:2311.15977
56
citations