Most Cited ICCV "single-image 3d generation" Papers

2,701 papers found • Page 2 of 14

#201

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen

ICCV 2025arXiv:2411.17473
19
citations
#202

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Haiwen Diao, Xiaotong Li, Yufeng Cui et al.

ICCV 2025highlightarXiv:2502.06788
19
citations
#203

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang et al.

ICCV 2025arXiv:2411.14384
18
citations
#204

FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

Hang Guo, Yawei Li, Taolin Zhang et al.

ICCV 2025arXiv:2503.23367
18
citations
#205

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Zheng-Peng Duan, jiawei zhang, Xin Jin et al.

ICCV 2025arXiv:2503.23580
18
citations
#206

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Shoubin Yu, Difan Liu, Ziqiao Ma et al.

ICCV 2025arXiv:2503.14350
18
citations
#207

LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Jieming Bian, Lei Wang, Letian Zhang et al.

ICCV 2025arXiv:2411.14961
18
citations
#208

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

Hongcheng Gao, Tianyu Pang, Chao Du et al.

ICCV 2025arXiv:2410.12777
18
citations
#209

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

Xiaobiao Du, Yida Wang, Haiyang Sun et al.

ICCV 2025arXiv:2406.04875
18
citations
#210

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Yuqing Wang, Zhijie Lin, Yao Teng et al.

ICCV 2025arXiv:2503.16430
18
citations
#211

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Dewei Zhou, Mingwei Li, Zongxin Yang et al.

ICCV 2025arXiv:2503.12885
18
citations
#212

FlowTok: Flowing Seamlessly Across Text and Image Tokens

Ju He, Qihang Yu, Qihao Liu et al.

ICCV 2025arXiv:2503.10772
18
citations
#213

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Qihan Huang, Weilong Dai, Jinlong Liu et al.

ICCV 2025arXiv:2503.23905
18
citations
#214

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs

Qiucheng Wu, Handong Zhao, Michael Saxon et al.

ICCV 2025
18
citations
#215

Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

Soonbin Lee, Fangwen Shu, Yago Sanchez de la Fuente et al.

ICCV 2025arXiv:2501.03399
18
citations
#216

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Junyuan Zhang, Qintong Zhang, Bin Wang et al.

ICCV 2025arXiv:2412.02592
17
citations
#217

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

YINWEI WU, Xianpan Zhou, bing ma et al.

ICCV 2025arXiv:2409.08240
17
citations
#218

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

Li, Nikolaos Tsagkas, Jifei Song et al.

ICCV 2025arXiv:2408.10123
17
citations
#219

ViLLa: Video Reasoning Segmentation with Large Language Model

rongkun Zheng, Lu Qi, Xi Chen et al.

ICCV 2025arXiv:2407.14500
17
citations
#220

Magic Insert: Style-Aware Drag-and-Drop

Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa et al.

ICCV 2025highlightarXiv:2407.02489
17
citations
#221

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Xiangxiang Chu, Renda Li, Yong Wang

ICCV 2025arXiv:2503.06132
17
citations
#222

AllTracker: Efficient Dense Point Tracking at High Resolution

Adam Harley, Yang You, Yang Zheng et al.

ICCV 2025arXiv:2506.07310
17
citations
#223

CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Dimitrios Mallis, Ahmet Karadeniz, Sebastian Cavada et al.

ICCV 2025arXiv:2412.13810
17
citations
#224

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

Haoxuan Wang, Jinlong Peng, Qingdong He et al.

ICCV 2025arXiv:2503.09277
17
citations
#225

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.

ICCV 2025arXiv:2412.04380
17
citations
#226

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin et al.

ICCV 2025arXiv:2412.09616
17
citations
#227

FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro et al.

ICCV 2025arXiv:2501.02430
17
citations
#228

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He, Danshi Li, Xinqiang Yu et al.

ICCV 2025highlightarXiv:2507.02747
17
citations
#229

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Yuping Wang, Xiangyu Huang, Xiaokang Sun et al.

ICCV 2025arXiv:2503.24381
17
citations
#230

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen et al.

ICCV 2025arXiv:2503.20785
17
citations
#231

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang, Baolu Li, Yiming Zhang et al.

ICCV 2025arXiv:2503.23368
17
citations
#232

DreamRelation: Relation-Centric Video Customization

Yujie Wei, Shiwei Zhang, Hangjie Yuan et al.

ICCV 2025arXiv:2503.07602
17
citations
#233

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

Zizhang Li, Hong-Xing Yu, Wei Liu et al.

ICCV 2025highlightarXiv:2505.18151
16
citations
#234

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He et al.

ICCV 2025arXiv:2508.02095
16
citations
#235

AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

Yilin Wei, Mu Lin, Yuhao Lin et al.

ICCV 2025arXiv:2503.07360
16
citations
#236

An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.

ICCV 2025arXiv:2501.05453
16
citations
#237

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Pinxin Liu, Luchuan Song, Junhua Huang et al.

ICCV 2025arXiv:2501.18898
16
citations
#238

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Yongsheng Yu, Ziyun Zeng, Haitian Zheng et al.

ICCV 2025arXiv:2503.08677
16
citations
#239

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Hao Li, Xiang Chen, Jiangxin Dong et al.

ICCV 2025arXiv:2412.01427
16
citations
#240

POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction

Songyan Zhang, Yongtao Ge, Jinyuan Tian et al.

ICCV 2025arXiv:2504.05692
16
citations
#241

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

ICCV 2025arXiv:2412.17007
16
citations
#242

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

Jiarui Wang, Huiyu Duan, Yu Zhao et al.

ICCV 2025highlightarXiv:2504.08358
16
citations
#243

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

Kaixuan Jiang, Yang Liu, Weixing Chen et al.

ICCV 2025arXiv:2503.11117
15
citations
#244

UnZipLoRA: Separating Content and Style from a Single Image

Chang Liu, Viraj Shah, Aiyu Cui et al.

ICCV 2025highlightarXiv:2412.04465
15
citations
#245

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye et al.

ICCV 2025arXiv:2412.09530
15
citations
#246

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

Simon Boeder, Fabian Gigengack, Benjamin Risse

ICCV 2025arXiv:2502.17288
15
citations
#247

Efficient Track Anything

Yunyang Xiong, Chong Zhou, Xiaoyu Xiang et al.

ICCV 2025arXiv:2411.18933
15
citations
#248

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.

ICCV 2025arXiv:2502.01720
15
citations
#249

I2VControl: Disentangled and Unified Video Motion Synthesis Control

Wanquan Feng, Tianhao Qi, Jiawei Liu et al.

ICCV 2025arXiv:2411.17765
15
citations
#250

Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Hongyang Wei, Shuaizheng Liu, Chun Yuan et al.

ICCV 2025arXiv:2503.11073
15
citations
#251

LONG3R: Long Sequence Streaming 3D Reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan et al.

ICCV 2025arXiv:2507.18255
15
citations
#252

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

JIACHENG RUAN, Wenzhen Yuan, Xian Gao et al.

ICCV 2025arXiv:2503.07478
15
citations
#253

V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Jianqi Chen, Biao Zhang, Xiangjun Tang et al.

ICCV 2025arXiv:2503.09631
15
citations
#254

BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

David Svitov, Pietro Morerio, Lourdes Agapito et al.

ICCV 2025arXiv:2411.08508
15
citations
#255

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

ICCV 2025arXiv:2411.09219
15
citations
#256

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.

ICCV 2025arXiv:2507.07990
14
citations
#257

MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

Hui Sun, Shiyin Lu, Huanyu Wang et al.

ICCV 2025arXiv:2501.02885
14
citations
#258

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen, Bohan Liu, Chenjia Li et al.

ICCV 2025arXiv:2503.21056
14
citations
#259

Referring to Any Person

Qing Jiang, Lin Wu, Zhaoyang Zeng et al.

ICCV 2025arXiv:2503.08507
14
citations
#260

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei, Yifan Zhou, Dongdong Chen et al.

ICCV 2025arXiv:2503.16153
14
citations
#261

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Jiawen Zhu, YEW-SOON ONG, Chunhua Shen et al.

ICCV 2025arXiv:2410.10289
14
citations
#262

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Fangwei Zhong, Kui Wu, Churan Wang et al.

ICCV 2025highlightarXiv:2412.20977
14
citations
#263

Unleashing Vecset Diffusion Model for Fast Shape Generation

Zeqiang Lai, Zhao Yunfei, Zibo Zhao et al.

ICCV 2025highlightarXiv:2503.16302
14
citations
#264

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Chin-Yang Lin, Cheng Sun, Fu-En Yang et al.

ICCV 2025arXiv:2508.14041
14
citations
#265

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song et al.

ICCV 2025arXiv:2501.10913
14
citations
#266

Find Any Part in 3D

Ziqi Ma, Yisong Yue, Georgia Gkioxari

ICCV 2025highlightarXiv:2411.13550
14
citations
#267

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Jiangran Lyu, Ziming Li, Xuesong Shi et al.

ICCV 2025arXiv:2503.16806
14
citations
#268

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Ruining Li, Chuanxia Zheng, Christian Rupprecht et al.

ICCV 2025highlightarXiv:2503.22677
14
citations
#269

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li, Meng Tian, Zhenyu Lin et al.

ICCV 2025arXiv:2503.21505
14
citations
#270

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao, Beier Zhu, Qianru Sun et al.

ICCV 2025arXiv:2504.18397
14
citations
#271

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang et al.

ICCV 2025arXiv:2503.07588
14
citations
#272

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

Yanzuo Lu, Yuxi Ren, Xin Xia et al.

ICCV 2025highlightarXiv:2507.18569
14
citations
#273

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen, Zehuan Wu, Yichen Liu et al.

ICCV 2025arXiv:2412.04842
13
citations
#274

Semi-supervised Concept Bottleneck Models

Lijie Hu, Tianhao Huang, Huanyi Xie et al.

ICCV 2025arXiv:2406.18992
13
citations
#275

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Felix Krause, Timy Phan, Ming Gui et al.

ICCV 2025arXiv:2501.04765
13
citations
#276

Growing a Twig to Accelerate Large Vision-Language Models

Zhenwei Shao, Mingyang Wang, Zhou Yu et al.

ICCV 2025arXiv:2503.14075
13
citations
#277

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.

ICCV 2025highlightarXiv:2503.07535
13
citations
#278

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Weirong Chen, Ganlin Zhang, Felix Wimbauer et al.

ICCV 2025arXiv:2504.14516
13
citations
#279

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Sung-Yeon Park, Can Cui, Yunsheng Ma et al.

ICCV 2025arXiv:2503.12772
13
citations
#280

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.

ICCV 2025arXiv:2507.02664
13
citations
#281

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang et al.

ICCV 2025arXiv:2503.10406
13
citations
#282

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

zhiliang wu, Kerui Chen, Kun Li et al.

ICCV 2025arXiv:2502.01181
13
citations
#283

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel et al.

ICCV 2025arXiv:2410.14462
13
citations
#284

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei, Yujie Zhong, yingsen zeng et al.

ICCV 2025arXiv:2412.14006
13
citations
#285

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Hengjia Li, Lifan Jiang, Xi Xiao et al.

ICCV 2025arXiv:2503.12689
13
citations
#286

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Jiale Cheng, Ruiliang Lyu, Xiaotao Gu et al.

ICCV 2025arXiv:2503.20491
13
citations
#287

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao et al.

ICCV 2025arXiv:2501.05020
13
citations
#288

Detect Anything 3D in the Wild

Hanxue Zhang, Haoran Jiang, Qingsong Yao et al.

ICCV 2025arXiv:2504.07958
13
citations
#289

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Yiran Qin, Li Kang, Xiufeng Song et al.

ICCV 2025arXiv:2503.16408
13
citations
#290

MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Zebin He, Mx Yang, Shuhui Yang et al.

ICCV 2025highlightarXiv:2503.10289
13
citations
#291

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

Guanxing Lu, Baoxiong Jia, Puhao Li et al.

ICCV 2025arXiv:2508.17600
13
citations
#292

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan, Ming Hu, Yiwen Jiang et al.

ICCV 2025highlightarXiv:2503.14911
13
citations
#293

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang, Zhijian Chen, Kai Lan et al.

ICCV 2025arXiv:2503.14478
13
citations
#294

Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

Hoang Phan, Tung Lam Tran, Quyen Tran et al.

ICCV 2025highlightarXiv:2211.13723
13
citations
#295

Learning Few-Step Diffusion Models by Trajectory Distribution Matching

Yihong Luo, Tianyang Hu, Jiacheng Sun et al.

ICCV 2025arXiv:2503.06674
13
citations
#296

CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance

Jinming Li, Yichen Zhu, Zhibin Tang et al.

ICCV 2025
13
citations
#297

MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

Pingrui Zhang, Xianqiang Gao, Yuhan Wu et al.

ICCV 2025arXiv:2503.11081
13
citations
#298

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Feng yan, Fanfan Liu, Yiyang Huang et al.

ICCV 2025arXiv:2412.07215
13
citations
#299

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Yuedong Tan, Zongwei Wu, Yuqian Fu et al.

ICCV 2025arXiv:2405.17773
13
citations
#300

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Yuanyuan Gao, Hao Li, Jiaqi Chen et al.

ICCV 2025arXiv:2503.23044
12
citations
#301

Mobile Video Diffusion

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas et al.

ICCV 2025arXiv:2412.07583
12
citations
#302

CharaConsist: Fine-Grained Consistent Character Generation

Mengyu Wang, Henghui Ding, Jianing Peng et al.

ICCV 2025arXiv:2507.11533
12
citations
#303

An OpenMind for 3D Medical Vision Self-supervised Learning

Tassilo Wald, Constantin Ulrich, Jonathan Suprijadi et al.

ICCV 2025arXiv:2412.17041
12
citations
#304

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Xi Fang, Jiankun Wang, Xiaochen Cai et al.

ICCV 2025arXiv:2411.11098
12
citations
#305

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Haiwen Huang, Anpei Chen, Volodymyr Havrylov et al.

ICCV 2025arXiv:2504.14032
12
citations
#306

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Daniel Winter, Asaf Shul, Matan Cohen et al.

ICCV 2025highlightarXiv:2412.08645
12
citations
#307

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul et al.

ICCV 2025arXiv:2410.10780
12
citations
#308

SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning

Zhewei Dai, Shilei Zeng, Haotian Liu et al.

ICCV 2025arXiv:2410.14987
12
citations
#309

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.

ICCV 2025arXiv:2508.06492
12
citations
#310

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Zhijian Huang, Chengjian Feng, Baihui Xiao et al.

ICCV 2025arXiv:2412.07689
12
citations
#311

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Lin Sun, Jiale Cao, Jin Xie et al.

ICCV 2025arXiv:2411.13836
12
citations
#312

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu HU, Na Zhao et al.

ICCV 2025arXiv:2502.02358
12
citations
#313

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Ming Li, Xin Gu, Fan Chen et al.

ICCV 2025arXiv:2505.02370
12
citations
#314

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

Tianyuan Qu, Longxiang Tang, Bohao PENG et al.

ICCV 2025arXiv:2503.12496
12
citations
#315

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Yabo Zhang, xinpeng zhou, Yihan Zeng et al.

ICCV 2025arXiv:2501.08225
12
citations
#316

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Yatian Pang, Bin Zhu, Bin Lin et al.

ICCV 2025arXiv:2412.00397
12
citations
#317

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Tsu-Jui Fu, Yusu Qian, Chen Chen et al.

ICCV 2025arXiv:2503.12652
12
citations
#318

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling et al.

ICCV 2025arXiv:2501.05710
12
citations
#319

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

Chirui CHANG, Jiahui Liu, Zhengzhe Liu et al.

ICCV 2025arXiv:2406.19568
12
citations
#320

HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng et al.

ICCV 2025arXiv:2503.08612
12
citations
#321

3D Mesh Editing using Masked LRMs

William Gao, Dilin Wang, Yuchen Fan et al.

ICCV 2025arXiv:2412.08641
12
citations
#322

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

JIXUAN FAN, Wanhua Li, Yifei Han et al.

ICCV 2025arXiv:2412.04887
12
citations
#323

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang, Sicheng Xu, Chuxin Wang et al.

ICCV 2025arXiv:2507.23785
12
citations
#324

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Yulin Pan, Xiangteng He, Chaojie Mao et al.

ICCV 2025arXiv:2503.14482
12
citations
#325

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Yue Fan, Xiaojian Ma, Rongpeng Su et al.

ICCV 2025highlightarXiv:2501.00358
12
citations
#326

SplatTalk: 3D VQA with Gaussian Splatting

Anh Thai, Kyle Genova, Songyou Peng et al.

ICCV 2025arXiv:2503.06271
12
citations
#327

Rectifying Magnitude Neglect in Linear Attention

Qihang Fan, Huaibo Huang, Yuang Ai et al.

ICCV 2025highlightarXiv:2507.00698
11
citations
#328

LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Alessio Spagnoletti, Jean Prost, Andres Almansa et al.

ICCV 2025arXiv:2503.12615
11
citations
#329

Synthetic Video Enhances Physical Fidelity in Video Synthesis

Qi Zhao, Xingyu Ni, Ziyu Wang et al.

ICCV 2025arXiv:2503.20822
11
citations
#330

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Hanling Zhang, Rundong Su, Zhihang Yuan et al.

ICCV 2025arXiv:2503.22796
11
citations
#331

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin et al.

ICCV 2025arXiv:2412.15191
11
citations
#332

SuperDec: 3D Scene Decomposition with Superquadrics Primitives

Elisabetta Fedele, Boyang Sun, Francis Engelmann et al.

ICCV 2025arXiv:2504.00992
11
citations
#333

Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Yun Wang, Longguang Wang, Chenghao Zhang et al.

ICCV 2025highlightarXiv:2507.04631
11
citations
#334

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Hyeonho Jeong, Suhyeon Lee, Jong Ye

ICCV 2025arXiv:2503.09151
11
citations
#335

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

Haoji Zhang, Yiqin Wang, Yansong Tang et al.

ICCV 2025arXiv:2506.23825
11
citations
#336

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Renshan Zhang, Rui Shao, Gongwei Chen et al.

ICCV 2025arXiv:2501.16297
11
citations
#337

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Katja Schwarz, Norman Müller, Peter Kontschieder

ICCV 2025arXiv:2503.13272
11
citations
#338

X-Dancer: Expressive Music to Human Dance Video Generation

Zeyuan Chen, Hongyi Xu, Guoxian Song et al.

ICCV 2025highlightarXiv:2502.17414
11
citations
#339

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Tatiana Zemskova, Dmitry Yudin

ICCV 2025arXiv:2412.18450
11
citations
#340

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.

ICCV 2025arXiv:2411.14401
11
citations
#341

ViSpeak: Visual Instruction Feedback in Streaming Videos

Shenghao Fu, Qize Yang, Yuan-Ming Li et al.

ICCV 2025arXiv:2503.12769
11
citations
#342

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

Jie Feng, Shengyuan Wang, Tianhui Liu et al.

ICCV 2025arXiv:2506.23219
11
citations
#343

Dense Policy: Bidirectional Autoregressive Learning of Actions

Yue Su, Xinyu Zhan, Hongjie Fang et al.

ICCV 2025arXiv:2503.13217
11
citations
#344

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fating Hong, Zunnan Xu, Zixiang Zhou et al.

ICCV 2025arXiv:2504.02542
11
citations
#345

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

Anurag Ghosh, Shen Zheng, Robert Tamburo et al.

ICCV 2025arXiv:2406.07661
11
citations
#346

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Fengxiang Wang, Hongzhen Wang, Di Wang et al.

ICCV 2025arXiv:2406.11933
11
citations
#347

SILO: Solving Inverse Problems with Latent Operators

Ron Raphaeli, Sean Man, Michael Elad

ICCV 2025arXiv:2501.11746
11
citations
#348

Contrastive Flow Matching

George Stoica, Vivek Ramanujan, Xiang Fan et al.

ICCV 2025arXiv:2506.05350
11
citations
#349

GENMO: A GENeralist Model for Human MOtion

Jiefeng Li, Jinkun Cao, Haotian Zhang et al.

ICCV 2025highlightarXiv:2505.01425
10
citations
#350

MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.

ICCV 2025arXiv:2505.00681
10
citations
#351

RANKCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.

ICCV 2025arXiv:2404.09387
10
citations
#352

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Kaichen Zhang, Yifei Shen, Bo Li et al.

ICCV 2025arXiv:2411.14982
10
citations
#353

Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

ziyu zhang, Binbin Huang, Hanqing Jiang et al.

ICCV 2025arXiv:2411.16392
10
citations
#354

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Yatai Ji, Jiacheng Zhang, Jie Wu et al.

ICCV 2025arXiv:2412.15156
10
citations
#355

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Meidan Ding et al.

ICCV 2025arXiv:2412.02141
10
citations
#356

Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

Shuyu Yang, Yaxiong Wang, Li Zhu et al.

ICCV 2025highlightarXiv:2411.17776
10
citations
#357

MagicColor: Multi-instance Sketch Colorization

yinhan Zhang, Yue Ma, Bingyuan Wang et al.

ICCV 2025
10
citations
#358

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al.

ICCV 2025arXiv:2501.02135
10
citations
#359

Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

Yuxuan Wang, Xuanyu Yi, Haohan Weng et al.

ICCV 2025arXiv:2501.14317
10
citations
#360

Decoupled Diffusion Sparks Adaptive Scene Generation

Yunsong Zhou, Naisheng Ye, William Ljungbergh et al.

ICCV 2025arXiv:2504.10485
10
citations
#361

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao, Xuqi Liu, Zhongqi Yue et al.

ICCV 2025arXiv:2504.06606
10
citations
#362

Advancing Textual Prompt Learning with Anchored Attributes

Zheng Li, Yibing Song, Ming-Ming Cheng et al.

ICCV 2025arXiv:2412.09442
10
citations
#363

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

Xuying Zhang, Yutong Liu, Yangguang Li et al.

ICCV 2025arXiv:2412.16919
10
citations
#364

A Recipe for Generating 3D Worlds from a Single Image

Katja Schwarz, Denis Rozumny, Samuel Rota Bulò et al.

ICCV 2025arXiv:2503.16611
10
citations
#365

Spectral Image Tokenizer

Carlos Esteves, Mohammed Suhail, Ameesh Makadia

ICCV 2025arXiv:2412.09607
10
citations
#366

Make Me Happier: Evoking Emotions Through Image Diffusion Models

Qing Lin, Jingfeng Zhang, YEW-SOON ONG et al.

ICCV 2025arXiv:2403.08255
10
citations
#367

Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo, Lajanugen Logeswaran, Justin Johnson et al.

ICCV 2025highlightarXiv:2505.00684
10
citations
#368

RoMo: Robust Motion Segmentation Improves Structure from Motion

Lily Goli, Sara Sabour, Mark Matthews et al.

ICCV 2025arXiv:2411.18650
10
citations
#369

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Yikang Zhou, Tao Zhang, Shilin Xu et al.

ICCV 2025arXiv:2501.04670
10
citations
#370

WildSAT: Learning Satellite Image Representations from Wildlife Observations

Rangel Daroya, Elijah Cole, Oisin Mac Aodha et al.

ICCV 2025arXiv:2412.14428
10
citations
#371

PanSt3R: Multi-view Consistent Panoptic Segmentation

Lojze Zust, Yohann Cabon, Juliette Marrie et al.

ICCV 2025arXiv:2506.21348
10
citations
#372

OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration

Yiming Zuo, Willow Yang, Zeyu Ma et al.

ICCV 2025arXiv:2411.19278
10
citations
#373

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma, Yuying Ge, Teng Wang et al.

ICCV 2025arXiv:2503.19480
10
citations
#374

No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang, Krystian Mikolajczyk

ICCV 2025highlightarXiv:2508.01171
10
citations
#375

Di[M]O: Distilling Masked Diffusion Models into One-step Generator

Yuanzhi Zhu, Xi WANG, Stéphane Lathuilière et al.

ICCV 2025
10
citations
#376

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Ruijie Lu, Yixin Chen, Yu Liu et al.

ICCV 2025arXiv:2503.12049
10
citations
#377

Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

Du Chen, Liyi Chen, Zhengqiang ZHANG et al.

ICCV 2025arXiv:2501.06838
10
citations
#378

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

yifei feng, Mx Yang, Shuhui Yang et al.

ICCV 2025arXiv:2503.19011
10
citations
#379

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Rohit Gandikota, Zongze Wu, Richard Zhang et al.

ICCV 2025arXiv:2502.01639
10
citations
#380

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

zijie wu, Chaohui Yu, Fan Wang et al.

ICCV 2025arXiv:2506.09982
10
citations
#381

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

Donald Shenaj, Ondrej Bohdal, Mete Ozay et al.

ICCV 2025arXiv:2412.05148
9
citations
#382

Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Qiaosi Yi, Shuai Li, Rongyuan Wu et al.

ICCV 2025highlightarXiv:2507.20291
9
citations
#383

DIVE: Taming DINO for Subject-Driven Video Editing

Yi Huang, Wei Xiong, He Zhang et al.

ICCV 2025arXiv:2412.03347
9
citations
#384

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

Zhixi Cai, Fucai Ke, Simindokht Jahangard et al.

ICCV 2025arXiv:2502.00372
9
citations
#385

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu, Yue Wu, Meng Chu et al.

ICCV 2025arXiv:2506.10857
9
citations
#386

HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

Tao Wang, Changxu Cheng, Lingfeng Wang et al.

ICCV 2025arXiv:2503.13026
9
citations
#387

Fine-Tuning Visual Autogressive Models for Subject-Driven Generation

Jiwoo Chung, Sangeek Hyun, Hyunjun Kim et al.

ICCV 2025
9
citations
#388

FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

Shuai Tan, Bill Gong, Bin Ji et al.

ICCV 2025arXiv:2507.01390
9
citations
#389

GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering

Kai Ye, Chong Gao, Guanbin Li et al.

ICCV 2025arXiv:2410.24204
9
citations
#390

EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

Xiaobao Wei, Qingpo Wuwu, Zhongyu Zhao et al.

ICCV 2025arXiv:2411.15582
9
citations
#391

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng, Yijiang Li, Wanpeng Zhang et al.

ICCV 2025arXiv:2411.16156
9
citations
#392

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Yanrui Bin, Wenbo Hu, Haoyuan Wang et al.

ICCV 2025arXiv:2504.11427
9
citations
#393

QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

Junyi Wu, Zhiteng Li, Zheng Hui et al.

ICCV 2025arXiv:2503.06545
9
citations
#394

Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng, Mingsheng Li, Jiakang Yuan et al.

ICCV 2025arXiv:2412.05983
9
citations
#395

MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj et al.

ICCV 2025arXiv:2504.06740
9
citations
#396

Knowledge Distillation with Refined Logits

Wujie Sun, Defang Chen, Siwei Lyu et al.

ICCV 2025arXiv:2408.07703
9
citations
#397

NeuralSVG: An Implicit Representation for Text-to-Vector Generation

Sagi Polaczek, Yuval Alaluf, Elad Richardson et al.

ICCV 2025arXiv:2501.03992
9
citations
#398

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

Peng Chen, Pi Bu, Yingyao Wang et al.

ICCV 2025arXiv:2503.09527
9
citations
#399

HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Jiaxin Lu, Chun-Hao Huang, Uttaran Bhattacharya et al.

ICCV 2025arXiv:2504.10414
9
citations
#400

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Jingjing Ren, Wenbo Li, Zhongdao Wang et al.

ICCV 2025arXiv:2504.14470
9
citations