Most Cited CVPR "video caption understanding" Papers

5,589 papers found • Page 11 of 28

#2001

Diversified and Personalized Multi-rater Medical Image Segmentation

Yicheng Wu, Xiangde Luo, Zhe Xu et al.

CVPR 2024highlightarXiv:2403.13417
16
citations
#2002

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

Junyan Lin, Haoran Chen, Yue Fan et al.

CVPR 2025arXiv:2503.06063
16
citations
#2003

Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models

Gianni Franchi, Olivier Laurent, Maxence Leguéry et al.

CVPR 2024arXiv:2312.15297
16
citations
#2004

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

Tariq Berrada, Jakob Verbeek, camille couprie et al.

CVPR 2024arXiv:2312.13314
16
citations
#2005

DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model

Zhenghao Pan, Haijin Zeng, Jiezhang Cao et al.

CVPR 2024arXiv:2311.11417
16
citations
#2006

MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes

Bor Shiun Wang, Chien-Yi Wang, Wei-Chen Chiu

CVPR 2024arXiv:2404.08968
16
citations
#2007

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yunlong Tang, JunJia Guo, Hang Hua et al.

CVPR 2025arXiv:2411.10979
16
citations
#2008

KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Yu Wang, Xin Li, Shengzhao Wen et al.

CVPR 2024arXiv:2211.08071
16
citations
#2009

InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields

Dongqing Wang, Tong Zhang, Alaa Abboud et al.

CVPR 2024arXiv:2305.15094
16
citations
#2010

Generating Content for HDR Deghosting from Frequency View

Tao Hu, Qingsen Yan, Yuankai Qi et al.

CVPR 2024arXiv:2404.00849
16
citations
#2011

Scaling Vision Pre-Training to 4K Resolution

Baifeng Shi, Boyi Li, Han Cai et al.

CVPR 2025highlightarXiv:2503.19903
16
citations
#2012

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Jeongsoo Choi, Se Jin Park, Minsu Kim et al.

CVPR 2024highlightarXiv:2312.02512
16
citations
#2013

Binarized Low-light Raw Video Enhancement

Gengchen Zhang, Yulun Zhang, Xin Yuan et al.

CVPR 2024arXiv:2403.19944
16
citations
#2014

How to Merge Your Multimodal Models Over Time?

Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth et al.

CVPR 2025arXiv:2412.06712
16
citations
#2015

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Ronghao Dang, Yuqian Yuan, Wenqi Zhang et al.

CVPR 2025arXiv:2501.05031
16
citations
#2016

SkillMimic: Learning Basketball Interaction Skills from Demonstrations

Yinhuai Wang, Qihan Zhao, Runyi Yu et al.

CVPR 2025highlightarXiv:2408.15270
16
citations
#2017

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Qing Yu, Mikihiro Tanaka, Kent Fujiwara

CVPR 2024arXiv:2405.04771
16
citations
#2018

Enhancing Vision-Language Pre-training with Rich Supervisions

Yuan Gao, Kunyu Shi, Pengkai Zhu et al.

CVPR 2024highlightarXiv:2403.03346
16
citations
#2019

Adapters Strike Back

Jan-Martin Steitz, Stefan Roth

CVPR 2024arXiv:2406.06820
16
citations
#2020

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su, Peihan Miao, Huanzhang Dou et al.

CVPR 2024arXiv:2406.18048
16
citations
#2021

Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields

Haoyuan Wang, Wenbo Hu, Lei Zhu et al.

CVPR 2024arXiv:2403.16224
16
citations
#2022

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Jun Zhou, Jiahao Li, Zunnan Xu et al.

CVPR 2025arXiv:2503.19839
16
citations
#2023

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Yixuan Zhu, Ao Li, Yansong Tang et al.

CVPR 2024arXiv:2404.01424
16
citations
#2024

Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte et al.

CVPR 2024arXiv:2406.01954
16
citations
#2025

Multiview Aerial Visual RECognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?

Aritra Dutta, Srijan Das, Jacob Nielsen et al.

CVPR 2024arXiv:2312.04548
16
citations
#2026

How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

CVPR 2024arXiv:2403.07203
16
citations
#2027

Multiple View Geometry Transformers for 3D Human Pose Estimation

Ziwei Liao, jialiang zhu, Chunyu Wang et al.

CVPR 2024arXiv:2311.10983
16
citations
#2028

NeRF Director: Revisiting View Selection in Neural Volume Rendering

Wenhui Xiao, Rodrigo Santa Cruz, David Ahmedt-Aristizabal et al.

CVPR 2024arXiv:2406.08839
16
citations
#2029

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Xianghui Yang, Gil Avraham, Yan Zuo et al.

CVPR 2024arXiv:2402.18842
16
citations
#2030

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Wei Suo, Lijun Zhang, Mengyang Sun et al.

CVPR 2025highlightarXiv:2503.00361
16
citations
#2031

Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Taeheon Kim, Sebin Shin, Youngjoon Yu et al.

CVPR 2024arXiv:2403.01300
16
citations
#2032

Degradation-Aware Feature Perturbation for All-in-One Image Restoration

Xiangpeng Tian, Xiangyu Liao, Xiao Liu et al.

CVPR 2025arXiv:2505.12630
16
citations
#2033

SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild

Andreas Engelhardt, Amit Raj, Mark Boss et al.

CVPR 2024arXiv:2401.10171
16
citations
#2034

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang et al.

CVPR 2024arXiv:2311.17123
16
citations
#2035

Wired Perspectives: Multi-View Wire Art Embraces Generative AI

Zhiyu Qu, LAN YANG, Honggang Zhang et al.

CVPR 2024arXiv:2311.15421
16
citations
#2036

Adversarial Score Distillation: When score distillation meets GAN

Min Wei, Jingkai Zhou, Junyao Sun et al.

CVPR 2024arXiv:2312.00739
16
citations
#2037

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Wenbo Wang, Fangyun Wei, Lei Zhou et al.

CVPR 2025arXiv:2412.02699
16
citations
#2038

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Gaoxiang Cong, Jiadong Pan, Liang Li et al.

CVPR 2025highlightarXiv:2412.08988
16
citations
#2039

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Jiuhai Chen, Jianwei Yang, Haiping Wu et al.

CVPR 2025arXiv:2412.04424
16
citations
#2040

Programmable Motion Generation for Open-Set Motion Control Tasks

Hanchao Liu, Xiaohang Zhan, Shaoli Huang et al.

CVPR 2024highlightarXiv:2405.19283
16
citations
#2041

MaGGIe: Masked Guided Gradual Human Instance Matting

Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava et al.

CVPR 2024arXiv:2404.16035
16
citations
#2042

TULIP: Transformer for Upsampling of LiDAR Point Clouds

Bin Yang, Patrick Pfreundschuh, Roland Siegwart et al.

CVPR 2024arXiv:2312.06733
16
citations
#2043

Generative Unlearning for Any Identity

Juwon Seo, Sung-Hoon Lee, Tae-Young Lee et al.

CVPR 2024arXiv:2405.09879
15
citations
#2044

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Mingqi Jiang, Saeed Khorram, Li Fuxin

CVPR 2024arXiv:2212.06872
15
citations
#2045

Continual Segmentation with Disentangled Objectness Learning and Class Recognition

Yizheng Gong, Siyue Yu, Xiaoyang Wang et al.

CVPR 2024arXiv:2403.03477
15
citations
#2046

A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?

Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund

CVPR 2024arXiv:2404.01775
15
citations
#2047

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Kangning Yin, Shihao Zou, Yuxuan Ge et al.

CVPR 2024highlightarXiv:2403.00691
15
citations
#2048

Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms

Joren Brunekreef, Eric Marcus, Ray Sheombarsing et al.

CVPR 2024arXiv:2311.11837
15
citations
#2049

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Tian Liu, Huixin Zhang, Shubham Parashar et al.

CVPR 2025arXiv:2406.11148
15
citations
#2050

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Runhui Huang, Xinpeng Ding, Chunwei Wang et al.

CVPR 2025arXiv:2407.08706
15
citations
#2051

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

Weiwei Cao, Jianpeng Zhang, Yingda Xia et al.

CVPR 2024arXiv:2404.04936
15
citations
#2052

VisionArena: 230k Real World User-VLM Conversations with Preference Labels

Christopher Chou, Lisa Dunlap, Wei-Lin Chiang et al.

CVPR 2025arXiv:2412.08687
15
citations
#2053

Scaling Properties of Diffusion Models For Perceptual Tasks

Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran et al.

CVPR 2025arXiv:2411.08034
15
citations
#2054

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

Yuelong Li, Yafei Mao, Raja Bala et al.

CVPR 2024arXiv:2403.08019
15
citations
#2055

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

CVPR 2024arXiv:2403.17420
15
citations
#2056

OmniMotionGPT: Animal Motion Generation with Limited Data

Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan et al.

CVPR 2024arXiv:2311.18303
15
citations
#2057

MVSAnywhere: Zero-Shot Multi-View Stereo

Sergio Izquierdo, Mohamed Sayed, Michael Firman et al.

CVPR 2025arXiv:2503.22430
15
citations
#2058

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek et al.

CVPR 2024arXiv:2402.08657
15
citations
#2059

Deformable One-shot Face Stylization via DINO Semantic Guidance

Yang Zhou, Zichong Chen, Hui Huang

CVPR 2024arXiv:2403.00459
15
citations
#2060

Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration

Hong Chen, Pei Yan, sihe xiang et al.

CVPR 2024highlight
15
citations
#2061

R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

Lijun Sheng, Jian Liang, Zilei Wang et al.

CVPR 2025arXiv:2504.11195
15
citations
#2062

Partial-to-Partial Shape Matching with Geometric Consistency

Viktoria Ehm, Maolin Gao, Paul Roetzer et al.

CVPR 2024arXiv:2404.12209
15
citations
#2063

Multiway Point Cloud Mosaicking with Diffusion and Global Optimization

Shengze Jin, Iro Armeni, Marc Pollefeys et al.

CVPR 2024arXiv:2404.00429
15
citations
#2064

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jing Yu, Keke Gai et al.

CVPR 2025arXiv:2503.17109
15
citations
#2065

A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning

Yuelin Zhang, Pengyu Zheng, Wanquan Yan et al.

CVPR 2024arXiv:2403.02611
15
citations
#2066

AniDoc: Animation Creation Made Easier

Yihao Meng, Hao Ouyang, Hanlin Wang et al.

CVPR 2025arXiv:2412.14173
15
citations
#2067

ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Hae Jin Song, Mahyar Khayatkhoei, Wael AbdAlmageed

CVPR 2024arXiv:2402.10401
15
citations
#2068

Efficient Vision-Language Pre-training by Cluster Masking

Zihao Wei, Zixuan Pan, Andrew Owens

CVPR 2024arXiv:2405.08815
15
citations
#2069

HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models

Yifan Yang, Dong Liu, Shuhai Zhang et al.

CVPR 2024arXiv:2404.04876
15
citations
#2070

CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization

Junhao Xu, Yanan Zhang, Zhi Cai et al.

CVPR 2025arXiv:2503.03430
15
citations
#2071

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Xiangyuan Xue, Zeyu Lu, Di Huang et al.

CVPR 2025arXiv:2409.01392
15
citations
#2072

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Qirui Jiao, Daoyuan Chen, Yilun Huang et al.

CVPR 2025arXiv:2408.04594
15
citations
#2073

MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints

Pengfei Xie, Wenqiang Xu, Tutian Tang et al.

CVPR 2024arXiv:2404.10227
15
citations
#2074

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Shiyao Li, Yingchun Hu, Xuefei Ning et al.

CVPR 2025arXiv:2412.19509
15
citations
#2075

Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter

Zhengyi Zhong, Weidong Bao, Ji Wang et al.

CVPR 2025arXiv:2502.20709
15
citations
#2076

Personalized Preference Fine-tuning of Diffusion Models

Meihua Dang, Anikait Singh, Linqi Zhou et al.

CVPR 2025arXiv:2501.06655
15
citations
#2077

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem et al.

CVPR 2025arXiv:2411.18674
15
citations
#2078

NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation

Vikas Thamizharasan, Difan Liu, Matthew Fisher et al.

CVPR 2024arXiv:2405.15217
15
citations
#2079

Bidirectional Autoregessive Diffusion Model for Dance Generation

Canyu Zhang, Youbao Tang, NING Zhang et al.

CVPR 2024
15
citations
#2080

FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan, Kebin wu, Fatima Albreiki

CVPR 2025arXiv:2504.01916
15
citations
#2081

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Yiren Song, Pei Yang, Hai Ci et al.

CVPR 2025arXiv:2412.11638
15
citations
#2082

One-Shot Structure-Aware Stylized Image Synthesis

Hansam Cho, Jonghyun Lee, Seunggyu Chang et al.

CVPR 2024arXiv:2402.17275
15
citations
#2083

Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang, Qian Yang, Aishwarya Agrawal

CVPR 2025highlightarXiv:2412.04616
15
citations
#2084

X-Dyna: Expressive Dynamic Human Image Animation

Di Chang, Hongyi Xu, You Xie et al.

CVPR 2025highlightarXiv:2501.10021
15
citations
#2085

DRAWER: Digital Reconstruction and Articulation With Environment Realism

Hongchi Xia, Entong Su, Marius Memmel et al.

CVPR 2025arXiv:2504.15278
15
citations
#2086

SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields

Quentin HERAU, Nathan Piasco, Moussab Bennehar et al.

CVPR 2024arXiv:2311.15803
15
citations
#2087

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Jinxiu Liu, Shaoheng Lin, Yinxiao Li et al.

CVPR 2025arXiv:2412.11100
15
citations
#2088

ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Wei Jiang, Junru Li, Kai Zhang et al.

CVPR 2025arXiv:2410.09706
15
citations
#2089

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

Shanlin Sun, Bingbing Zhuang, Ziyu Jiang et al.

CVPR 2024highlightarXiv:2405.00900
15
citations
#2090

Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

Jiaming Zhang, Junhong Ye, Xingjun Ma et al.

CVPR 2025arXiv:2410.05346
15
citations
#2091

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Ziyang Luo, Haoning Wu, Dongxu Li et al.

CVPR 2025arXiv:2411.13281
15
citations
#2092

Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation

Xiaoyang Chen, Hao Zheng, Yuemeng LI et al.

CVPR 2024arXiv:2311.10696
15
citations
#2093

S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting

Yecong Wan, Mingwen Shao, Yuanshuo Cheng et al.

CVPR 2025arXiv:2503.04314
15
citations
#2094

On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving

Kaituo Feng, Changsheng Li, Dongchun Ren et al.

CVPR 2024arXiv:2403.01238
15
citations
#2095

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

Xin Wang, Kai Chen, Jiaming Zhang et al.

CVPR 2025arXiv:2411.13136
15
citations
#2096

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez et al.

CVPR 2025arXiv:2503.13399
15
citations
#2097

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Yingdong Shi, Changming Li, Yifan Wang et al.

CVPR 2025arXiv:2503.20483
15
citations
#2098

Docopilot: Improving Multimodal Models for Document-Level Understanding

Yuchen Duan, Zhe Chen, Yusong Hu et al.

CVPR 2025arXiv:2507.14675
15
citations
#2099

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian et al.

CVPR 2025arXiv:2503.13646
15
citations
#2100

GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

Enguang Wang, Zhimao Peng, Zhengyuan Xie et al.

CVPR 2025arXiv:2403.09974
15
citations
#2101

Customization Assistant for Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu et al.

CVPR 2024arXiv:2312.03045
15
citations
#2102

GenN2N: Generative NeRF2NeRF Translation

Xiangyue Liu, Han Xue, Kunming Luo et al.

CVPR 2024arXiv:2404.02788
15
citations
#2103

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng et al.

CVPR 2025arXiv:2309.03904
15
citations
#2104

DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching

Shuzhe Wang, Juho Kannala, Daniel Barath

CVPR 2024arXiv:2306.12547
15
citations
#2105

Instance-Aware Group Quantization for Vision Transformers

Jaehyeon Moon, Dohyung Kim, Jun Yong Cheon et al.

CVPR 2024arXiv:2404.00928
15
citations
#2106

Leveraging Predicate and Triplet Learning for Scene Graph Generation

Jiankai Li, Yunhong Wang, Xiefan Guo et al.

CVPR 2024arXiv:2406.02038
15
citations
#2107

Neural Clustering based Visual Representation Learning

Guikun Chen, Xia Li, Yi Yang et al.

CVPR 2024arXiv:2403.17409
15
citations
#2108

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

Xuying Zhang, Bo-Wen Yin, yuming chen et al.

CVPR 2024arXiv:2312.04248
15
citations
#2109

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Yunhao Ge, Yihe Tang, Jiashu Xu et al.

CVPR 2024highlightarXiv:2405.09546
15
citations
#2110

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Akash Sengupta, Thiemo Alldieck, NIKOS KOLOTOUROS et al.

CVPR 2024arXiv:2404.00485
15
citations
#2111

BiPer: Binary Neural Networks using a Periodic Function

Edwin Vargas, Claudia Correa, Carlos Hinojosa et al.

CVPR 2024arXiv:2404.01278
15
citations
#2112

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

Yifang Men, Biwen Lei, Yuan Yao et al.

CVPR 2024arXiv:2401.01173
15
citations
#2113

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

Jianping Jiang, xinyu zhou, Bingxuan Wang et al.

CVPR 2024arXiv:2403.07346
15
citations
#2114

GenesisTex: Adapting Image Denoising Diffusion to Texture Space

Chenjian Gao, Boyan Jiang, Xinghui Li et al.

CVPR 2024arXiv:2403.17782
15
citations
#2115

Learned Image Compression with Dictionary-based Entropy Model

Jingbo Lu, Leheng Zhang, Xingyu Zhou et al.

CVPR 2025arXiv:2504.00496
15
citations
#2116

UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets

Youngju Na, Woo Jae Kim, Kyu Han et al.

CVPR 2024arXiv:2403.05086
15
citations
#2117

DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

Chengxiang Fan, Muzhi Zhu, Hao Chen et al.

CVPR 2024arXiv:2405.10185
15
citations
#2118

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.

CVPR 2025arXiv:2503.16036
15
citations
#2119

ILIAS: Instance-Level Image retrieval At Scale

Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko et al.

CVPR 2025arXiv:2502.11748
15
citations
#2120

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

Zilan Wang, Junfeng Guo, Jiacheng Zhu et al.

CVPR 2025arXiv:2412.04852
15
citations
#2121

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Xuesong Chen, Linjiang Huang, Tao Ma et al.

CVPR 2025arXiv:2505.16805
15
citations
#2122

4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video

Qiang Hu, Zihan Zheng, Houqiang Zhong et al.

CVPR 2025arXiv:2503.18421
15
citations
#2123

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Kun Yuan, Hongbo Liu, Mading Li et al.

CVPR 2024arXiv:2405.17765
15
citations
#2124

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

Liyuan Zhu, Shengyu Huang, Konrad Schindler et al.

CVPR 2024highlightarXiv:2312.09138
15
citations
#2125

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Youngjoon Jang, Haran Raajesh, Liliane Momeni et al.

CVPR 2025arXiv:2501.09754
15
citations
#2126

Towards Universal Soccer Video Understanding

Jiayuan Rao, Haoning Wu, Hao Jiang et al.

CVPR 2025arXiv:2412.01820
15
citations
#2127

MangaNinja: Line Art Colorization with Precise Reference Following

Zhiheng Liu, Ka Leong Cheng, Xi Chen et al.

CVPR 2025highlightarXiv:2501.08332
15
citations
#2128

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

CVPR 2024arXiv:2403.07214
15
citations
#2129

FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

Haokun Chen, Hang Li, Yao Zhang et al.

CVPR 2025arXiv:2410.04810
15
citations
#2130

Dynamic Camera Poses and Where to Find Them

Chris Rockwell, Joseph Tung, Tsung-Yi Lin et al.

CVPR 2025arXiv:2504.17788
15
citations
#2131

Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes

Chi-Hsi Kung, 書緯 呂, Yi-Hsuan Tsai et al.

CVPR 2024arXiv:2311.17948
15
citations
#2132

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

Ganlong Zhao, Guanbin Li, Weikai Chen et al.

CVPR 2024arXiv:2403.17334
15
citations
#2133

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

Myeongseob Ko, Feiyang Kang, Weiyan Shi et al.

CVPR 2024arXiv:2402.08922
15
citations
#2134

Cyclic Learning for Binaural Audio Generation and Localization

Zhaojian Li, Bin Zhao, Yuan Yuan

CVPR 2024
15
citations
#2135

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

Bin Wu, Wuxuan Shi, Jinqiao Wang et al.

CVPR 2025arXiv:2503.04229
15
citations
#2136

FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

Yifan Wang, Peishan Yang, Zhen Xu et al.

CVPR 2025
15
citations
#2137

STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Haiyi Qiu, Minghe Gao, Long Qian et al.

CVPR 2025arXiv:2412.00161
15
citations
#2138

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

Li Maomao, Yu Li, Tianyu Yang et al.

CVPR 2024arXiv:2312.05856
15
citations
#2139

RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

Kang You, Tong Chen, Dandan Ding et al.

CVPR 2025arXiv:2503.12382
15
citations
#2140

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Simindokht Jahangard, Zhixi Cai, Shiki Wen et al.

CVPR 2024arXiv:2404.04458
15
citations
#2141

Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification

Jiancheng Zhang, Haijin Zeng, Yongyong Chen et al.

CVPR 2024
15
citations
#2142

NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs

Michael Fischer, Zhengqin Li, Thu Nguyen-Phuoc et al.

CVPR 2024arXiv:2402.08622
15
citations
#2143

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

Alex Trevithick, Matthew Chan, Towaki Takikawa et al.

CVPR 2024arXiv:2401.02411
15
citations
#2144

JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba

Xiaoyong Lu, Songlin Du

CVPR 2025arXiv:2503.03437
15
citations
#2145

IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera

Jian Huang, Chengrui Dong, Xuanhua Chen et al.

CVPR 2025highlightarXiv:2410.08107
15
citations
#2146

CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

Chen Cheng, Jiacheng Wei, Tianrun Chen et al.

CVPR 2025arXiv:2504.04753
15
citations
#2147

NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

Yufei Han, Heng Guo, Koki Fukai et al.

CVPR 2024arXiv:2406.07111
15
citations
#2148

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Jun-Kun Chen, Samuel Rota Bulò, Norman Müller et al.

CVPR 2024arXiv:2406.09404
15
citations
#2149

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Sangmin Lee, Bolin Lai, Fiona Ryan et al.

CVPR 2024arXiv:2403.02090
15
citations
#2150

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Cheng-De Fan, Chen-Wei Chang, Yi-Ruei Liu et al.

CVPR 2025arXiv:2410.17249
15
citations
#2151

Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective

Jinjing Zhao, Fangyun Wei, Chang Xu

CVPR 2024
15
citations
#2152

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Baoquan Zhang, Huaibin Wang, Luo Chuyao et al.

CVPR 2024arXiv:2403.10071
15
citations
#2153

HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions

Hao Xu, Li Haipeng, Yinqiao Wang et al.

CVPR 2024arXiv:2403.18575
14
citations
#2154

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Guy Yariv, Yuval Kirstain, Amit Zohar et al.

CVPR 2025arXiv:2501.03059
14
citations
#2155

Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition

Zihan Wang, Siyang Song, Cheng Luo et al.

CVPR 2024arXiv:2404.06443
14
citations
#2156

Unifying Automatic and Interactive Matting with Pretrained ViTs

Zixuan Ye, Wenze Liu, He Guo et al.

CVPR 2024
14
citations
#2157

Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments

Luke Rowe, Roger Girgis, Anthony Gosselin et al.

CVPR 2025arXiv:2503.22496
14
citations
#2158

SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects

Abhinav Kumar, Yuliang Guo, Xinyu Huang et al.

CVPR 2024arXiv:2403.20318
14
citations
#2159

ProTeCt: Prompt Tuning for Taxonomic Open Set Classification

Tz-Ying Wu, Chih-Hui Ho, Nuno Vasconcelos

CVPR 2024arXiv:2306.02240
14
citations
#2160

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Junzhe Chen, Tianshu Zhang, Shiyu Huang et al.

CVPR 2025arXiv:2411.15268
14
citations
#2161

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Zeren Jiang, Chen Guo, Manuel Kaufmann et al.

CVPR 2024arXiv:2406.01595
14
citations
#2162

Unified Entropy Optimization for Open-Set Test-Time Adaptation

Zhengqing Gao, Xu-Yao Zhang, Cheng-Lin Liu

CVPR 2024arXiv:2404.06065
14
citations
#2163

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer et al.

CVPR 2025highlightarXiv:2408.16807
14
citations
#2164

Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

Zhengdi Yu, Stefanos Zafeiriou, Tolga Birdal

CVPR 2025highlightarXiv:2412.12861
14
citations
#2165

Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang, Guoliang Kang

CVPR 2024arXiv:2408.06747
14
citations
#2166

Region-Based Representations Revisited

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao et al.

CVPR 2024arXiv:2402.02352
14
citations
#2167

SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective

Yu-Bang Zheng, Xile Zhao, Junhua Zeng et al.

CVPR 2024highlightarXiv:2305.14912
14
citations
#2168

Misalignment-Robust Frequency Distribution Loss for Image Transformation

Zhangkai Ni, Juncheng Wu, Zian Wang et al.

CVPR 2024arXiv:2402.18192
14
citations
#2169

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz et al.

CVPR 2025arXiv:2411.15232
14
citations
#2170

Functional Diffusion

Biao Zhang, Peter Wonka

CVPR 2024arXiv:2311.15435
14
citations
#2171

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Yawen Shao, Wei Zhai, Yuhang Yang et al.

CVPR 2025arXiv:2411.19626
14
citations
#2172

Novel Class Discovery for Ultra-Fine-Grained Visual Categorization

Qi Jia, Yaqi Cai, Qi Jia et al.

CVPR 2024highlightarXiv:2405.06283
14
citations
#2173

Human Motion Instruction Tuning

Lei Li, Sen Jia, Jianhao Wang et al.

CVPR 2025arXiv:2411.16805
14
citations
#2174

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan et al.

CVPR 2025arXiv:2504.13157
14
citations
#2175

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Yumeng Liu, Xiaoxiao Long, Zemin Yang et al.

CVPR 2025arXiv:2411.14280
14
citations
#2176

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar Jr et al.

CVPR 2025arXiv:2503.14880
14
citations
#2177

Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant, Ethan Weber, Jin Kyu Kim et al.

CVPR 2025highlightarXiv:2502.07785
14
citations
#2178

Memory-based Adapters for Online 3D Scene Perception

Xiuwei Xu, Chong Xia, Ziwei Wang et al.

CVPR 2024arXiv:2403.06974
14
citations
#2179

Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification

Tingting Zheng, Kui Jiang, Hongxun Yao

CVPR 2024highlightarXiv:2403.07939
14
citations
#2180

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Weijia Wu, Mingyu Liu, Zeyu Zhu et al.

CVPR 2025arXiv:2411.15262
14
citations
#2181

Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse

Yining Wang, Junjie Sun, Chenyue Wang et al.

CVPR 2024highlightarXiv:2405.05587
14
citations
#2182

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee et al.

CVPR 2025highlightarXiv:2503.15871
14
citations
#2183

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Jian Yang, Dacheng Yin, Yizhou Zhou et al.

CVPR 2025arXiv:2410.10798
14
citations
#2184

Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving

Mozhgan Pourkeshavarz, Mohammad Sabokrou, Amir Rasouli

CVPR 2024arXiv:2306.15755
14
citations
#2185

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models

Xin Zhang, Yanzhao Zhang, Wen Xie et al.

CVPR 2025
14
citations
#2186

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou et al.

CVPR 2025arXiv:2412.02030
14
citations
#2187

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Seokhyeon Hong, Chaelin Kim, Serin Yoon et al.

CVPR 2025arXiv:2503.13836
14
citations
#2188

3D Neural Edge Reconstruction

Lei Li, Songyou Peng, Zehao Yu et al.

CVPR 2024arXiv:2405.19295
14
citations
#2189

MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM

Vladimir Yugay, Theo Gevers, Martin R. Oswald

CVPR 2025arXiv:2411.16785
14
citations
#2190

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Runsong Zhu, Shi Qiu, ZHENGZHE LIU et al.

CVPR 2025arXiv:2503.14029
14
citations
#2191

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong, Peng Zhang, Tao You et al.

CVPR 2024arXiv:2403.01226
14
citations
#2192

Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space

Yifan Zhou, Zeqi Xiao, Shuai Yang et al.

CVPR 2025arXiv:2503.09419
14
citations
#2193

SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining

Mingjin Zhang, Xiaolong Li, Fei Gao et al.

CVPR 2025
14
citations
#2194

MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading

Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got et al.

CVPR 2024arXiv:2312.13091
14
citations
#2195

From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing

Jingxuan Wei, Cheng Tan, Qi Chen et al.

CVPR 2025highlightarXiv:2411.11916
14
citations
#2196

Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching

Rui Gong, Weide Liu, ZAIWANG GU et al.

CVPR 2024arXiv:2402.19270
14
citations
#2197

HEAL-SWIN: A Vision Transformer On The Sphere

Oscar Carlsson, Jan E. Gerken, Hampus Linander et al.

CVPR 2024arXiv:2307.07313
14
citations
#2198

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

Hanrong Ye, Dan Xu

CVPR 2024arXiv:2403.15389
14
citations
#2199

Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples

Junhao Dong, Piotr Koniusz, Junxi Chen et al.

CVPR 2024
14
citations
#2200

Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users

Daniela Massiceti, Camilla Longden, Agnieszka Słowik et al.

CVPR 2024arXiv:2311.17315
14
citations