Most Cited ICCV Paper "spatial click propagation" Papers

2,701 papers found • Page 1 of 14

#1

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025arXiv:2411.10440
360
citations
#2

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025arXiv:2503.01785
357
citations
#3

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

yi yang, Xiaoxuan He, Hongkun Pan et al.

ICCV 2025arXiv:2503.10615
265
citations
#4

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Yuzhang Shang, Mu Cai, Bingxin Xu et al.

ICCV 2025arXiv:2403.15388
234
citations
#5

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, zehai he, Wenyi Hong et al.

ICCV 2025highlightarXiv:2406.08035
229
citations
#6

OminiControl: Minimal and Universal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.

ICCV 2025highlightarXiv:2411.15098
225
citations
#7

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

ICCV 2025highlightarXiv:2410.11831
223
citations
#8

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

ICCV 2025arXiv:2503.12937
220
citations
#9

Shape of Motion: 4D Reconstruction from a Single Video

Qianqian Wang, Vickie Ye, Hang Gao et al.

ICCV 2025highlightarXiv:2407.13764
186
citations
#10

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

ICCV 2025arXiv:2503.07598
181
citations
#11

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu et al.

ICCV 2025arXiv:2412.14164
150
citations
#12

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

CHENMING ZHU, Tai Wang, Wenwei Zhang et al.

ICCV 2025
127
citations
#13

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu et al.

ICCV 2025arXiv:2406.08451
113
citations
#14

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Jianhong Bai, Menghan Xia, Xiao Fu et al.

ICCV 2025arXiv:2503.11647
110
citations
#15

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas et al.

ICCV 2025arXiv:2412.08629
101
citations
#16

Randomized Autoregressive Visual Generation

Qihang Yu, Ju He, Xueqing Deng et al.

ICCV 2025arXiv:2411.00776
93
citations
#17

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin, Jianwen Jiang, Jiaqi Yang et al.

ICCV 2025highlightarXiv:2502.01061
91
citations
#18

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

shaojin wu, Mengqi Huang, wenxu wu et al.

ICCV 2025arXiv:2504.02160
90
citations
#19

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

ICCV 2025
89
citations
#20

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou, Hang Gao, Vikram Voleti et al.

ICCV 2025arXiv:2503.14489
87
citations
#21

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.

ICCV 2025arXiv:2504.10483
85
citations
#22

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025arXiv:2412.03632
76
citations
#23

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025arXiv:2503.07027
75
citations
#24

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

shiduo zhang, Zhe Xu, Peiju Liu et al.

ICCV 2025arXiv:2412.18194
74
citations
#25

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025arXiv:2501.04003
74
citations
#26

MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

Yiwen Chen, Yikai Wang, Yihao Luo et al.

ICCV 2025arXiv:2408.02555
72
citations
#27

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

ICCV 2025highlightarXiv:2501.08325
70
citations
#28

HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Keqiang Sun, Xiaoshi Wu et al.

ICCV 2025arXiv:2508.03789
69
citations
#29

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025arXiv:2503.19755
66
citations
#30

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang et al.

ICCV 2025arXiv:2412.07825
65
citations
#31

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.

ICCV 2025arXiv:2312.12491
64
citations
#32

Long Context Tuning for Video Generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.

ICCV 2025arXiv:2503.10589
60
citations
#33

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Xuemeng Yang, Licheng Wen, Tiantian Wei et al.

ICCV 2025arXiv:2408.00415
60
citations
#34

Golden Noise for Diffusion Models: A Learning Framework

zikai zhou, Shitong Shao, Lichen Bai et al.

ICCV 2025arXiv:2411.09502
59
citations
#35

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025highlightarXiv:2502.11079
59
citations
#36

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin, Le Zhuo, Yi Xin et al.

ICCV 2025arXiv:2503.21758
58
citations
#37

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Chen Ziwen, Hao Tan, Kai Zhang et al.

ICCV 2025highlightarXiv:2410.12781
58
citations
#38

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong et al.

ICCV 2025arXiv:2410.16268
56
citations
#39

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Yingyan Li, Yuqi Wang, Yang Liu et al.

ICCV 2025arXiv:2504.01941
55
citations
#40

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Wangbo Yu, Chaoran Feng, Jianing Li et al.

ICCV 2025arXiv:2405.20224
54
citations
#41

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Sucheng Ren, Qihang Yu, Ju He et al.

ICCV 2025arXiv:2502.20388
53
citations
#42

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025arXiv:2504.16072
53
citations
#43

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao et al.

ICCV 2025arXiv:2411.13807
52
citations
#44

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

ICCV 2025arXiv:2412.18607
51
citations
#45

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Chongjie Ye, Yushuang Wu, Ziteng Lu et al.

ICCV 2025arXiv:2503.22236
50
citations
#46

Aether: Geometric-Aware Unified World Modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.

ICCV 2025arXiv:2503.18945
50
citations
#47

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel et al.

ICCV 2025arXiv:2504.11171
50
citations
#48

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin et al.

ICCV 2025arXiv:2411.02038
49
citations
#49

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Xingyu Chen, Yue Chen, Yuliang Xiu et al.

ICCV 2025arXiv:2503.24391
48
citations
#50

WorldScore: Unified Evaluation Benchmark for World Generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.

ICCV 2025
46
citations
#51

St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

Haiwen Feng, Junyi Zhang, Qianqian Wang et al.

ICCV 2025arXiv:2504.13152
46
citations
#52

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu et al.

ICCV 2025arXiv:2412.01818
45
citations
#53

Learning 4D Embodied World Models

Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.

ICCV 2025arXiv:2504.20995
45
citations
#54

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He et al.

ICCV 2025arXiv:2411.02397
45
citations
#55

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Junjie He, Yifeng Geng, Liefeng Bo

ICCV 2025arXiv:2408.05939
44
citations
#56

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Hao He, Ceyuan Yang, Shanchuan Lin et al.

ICCV 2025arXiv:2503.10592
44
citations
#57

QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning

Haoxuan Wang, Yuzhang Shang, Zhihang Yuan et al.

ICCV 2025arXiv:2402.03666
44
citations
#58

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Chunwei Wang, Guansong Lu, Junwei Yang et al.

ICCV 2025arXiv:2412.06673
44
citations
#59

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang et al.

ICCV 2025arXiv:2503.17973
43
citations
#60

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

Alexander Mai, Peter Hedman, George Kopanas et al.

ICCV 2025arXiv:2410.01804
43
citations
#61

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Zhibo Yang, Jun Tang, Zhaohai Li et al.

ICCV 2025arXiv:2412.02210
43
citations
#62

YOLOE: Real-Time Seeing Anything

Ao Wang, Lihao Liu, Hui Chen et al.

ICCV 2025arXiv:2503.07465
43
citations
#63

Scaling Language-Free Visual Representation Learning

David Fan, Shengbang Tong, Jiachen Zhu et al.

ICCV 2025highlightarXiv:2504.01017
41
citations
#64

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao et al.

ICCV 2025highlightarXiv:2503.09641
41
citations
#65

PartField: Learning 3D Feature Fields for Part Segmentation and Beyond

Minghua Liu, Mikaela Uy, Donglai Xiang et al.

ICCV 2025arXiv:2504.11451
41
citations
#66

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Zhi Hou, Tianyi Zhang, Yuwen Xiong et al.

ICCV 2025arXiv:2503.19757
40
citations
#67

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina et al.

ICCV 2025highlightarXiv:2504.07961
40
citations
#68

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Li Hu, wang yuan, Zhen Shen et al.

ICCV 2025arXiv:2502.06145
40
citations
#69

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

ICCV 2025arXiv:2406.17840
39
citations
#70

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Yun Li, Yiming Zhang, Tao Lin et al.

ICCV 2025arXiv:2503.23765
38
citations
#71

Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

Yuqi Li, Chuanguang Yang, Hansheng Zeng et al.

ICCV 2025arXiv:2507.02939
38
citations
#72

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Wanshui Gan, Fang Liu, Hongbin Xu et al.

ICCV 2025arXiv:2408.11447
38
citations
#73

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu et al.

ICCV 2025arXiv:2503.06923
37
citations
#74

FaceXFormer: A Unified Transformer for Facial Analysis

Kartik Narayan, Vibashan VS, Rama Chellappa et al.

ICCV 2025arXiv:2403.12960
37
citations
#75

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu et al.

ICCV 2025arXiv:2503.21979
37
citations
#76

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

Xianglong He, Zi-Xin Zou, Chia Hao Chen et al.

ICCV 2025arXiv:2503.21732
36
citations
#77

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Erik Daxberger, Nina Wenzel, David Griffiths et al.

ICCV 2025arXiv:2503.13111
36
citations
#78

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025arXiv:2412.03859
36
citations
#79

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

ICCV 2025highlightarXiv:2503.15264
35
citations
#80

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

yifei xia, Suhan Ling, Fangcheng Fu et al.

ICCV 2025arXiv:2502.21079
35
citations
#81

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Mark YU, Wenbo Hu, Jinbo Xing et al.

ICCV 2025arXiv:2503.05638
35
citations
#82

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao, James Jun Liang Chen Ye, Zhengyi Wang et al.

ICCV 2025arXiv:2503.15265
35
citations
#83

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Yifan Lu, Xuanchi Ren, Jiawei Yang et al.

ICCV 2025arXiv:2412.03934
35
citations
#84

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Ahmed Nassar, Matteo Omenetti, Maksym Lysak et al.

ICCV 2025arXiv:2503.11576
34
citations
#85

A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Rongtao Xu, Jian Zhang, Minghao Guo et al.

ICCV 2025arXiv:2504.12636
34
citations
#86

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Lukas Höllein, Aljaz Bozic, Michael Zollhöfer et al.

ICCV 2025arXiv:2409.12892
34
citations
#87

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang et al.

ICCV 2025arXiv:2504.01901
33
citations
#88

Improved Noise Schedule for Diffusion Training

Tiankai Hang, Shuyang Gu, Jianmin Bao et al.

ICCV 2025arXiv:2407.03297
33
citations
#89

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Taowen Wang, Cheng Han, James Liang et al.

ICCV 2025arXiv:2411.13587
33
citations
#90

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

Hailong Guo, Bohan Zeng, Yiren Song et al.

ICCV 2025arXiv:2501.15891
32
citations
#91

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

ICCV 2025arXiv:2504.16080
32
citations
#92

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Yupeng Zheng, Pengxuan Yang, Zebin Xing et al.

ICCV 2025arXiv:2507.00603
31
citations
#93

Long-Context State-Space Video World Models

Ryan Po, Yotam Nitzan, Richard Zhang et al.

ICCV 2025arXiv:2505.20171
31
citations
#94

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025arXiv:2412.10471
31
citations
#95

VSSD: Vision Mamba with Non-Causal State Space Duality

Yuheng Shi, Mingjia Li, Minjing Dong et al.

ICCV 2025arXiv:2407.18559
30
citations
#96

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Hao Fang, Jiawei Kong, Wenbo Yu et al.

ICCV 2025arXiv:2406.05491
30
citations
#97

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

Teng Li, Guangcong Zheng, Rui Jiang et al.

ICCV 2025arXiv:2502.10059
30
citations
#98

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025arXiv:2501.04931
30
citations
#99

Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu et al.

ICCV 2025arXiv:2506.24113
30
citations
#100

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Dongwon Kim, Ju He, Qihang Yu et al.

ICCV 2025arXiv:2501.07730
30
citations
#101

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao, Shunlin Lu, Huaijin Pi et al.

ICCV 2025arXiv:2503.15451
29
citations
#102

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Xin Zhou, DINGKANG LIANG, Sifan Tu et al.

ICCV 2025arXiv:2501.14729
29
citations
#103

Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan et al.

ICCV 2025arXiv:2503.14445
29
citations
#104

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Liming Jiang, Qing Yan, Yumin Jia et al.

ICCV 2025highlightarXiv:2503.16418
29
citations
#105

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang, Yuan Liu, Ziwei Liu et al.

ICCV 2025arXiv:2410.16892
28
citations
#106

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Chun-Han Yao, Yiming Xie, Vikram Voleti et al.

ICCV 2025arXiv:2503.16396
28
citations
#107

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Yiren Song, Danze Chen, Mike Zheng Shou

ICCV 2025arXiv:2502.01105
28
citations
#108

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025highlightarXiv:2507.04047
28
citations
#109

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Rui Xie, Yinhong Liu, Penghao Zhou et al.

ICCV 2025arXiv:2501.02976
27
citations
#110

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai, Jiangning Zhang, Haoyang He et al.

ICCV 2025arXiv:2410.16236
27
citations
#111

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

ICCV 2025highlightarXiv:2411.19325
27
citations
#112

KV-Edit: Training-Free Image Editing for Precise Background Preservation

Tianrui Zhu, Shiyi Zhang, Jiawei Shao et al.

ICCV 2025arXiv:2502.17363
27
citations
#113

EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Zexuan Yan, Yue Ma, Chang Zou et al.

ICCV 2025arXiv:2503.10270
27
citations
#114

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Kyle Sargent, Kyle Hsu, Justin Johnson et al.

ICCV 2025arXiv:2503.11056
27
citations
#115

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.

ICCV 2025arXiv:2406.14240
27
citations
#116

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song et al.

ICCV 2025arXiv:2411.17762
27
citations
#117

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.

ICCV 2025arXiv:2503.18438
26
citations
#118

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Yufei Zhan, Shurong Zheng, Yousong Zhu et al.

ICCV 2025arXiv:2403.09333
26
citations
#119

Scaling Laws for Native Multimodal Models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa et al.

ICCV 2025arXiv:2504.07951
26
citations
#120

REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

Haonan Han, Rui Yang, Huan Liao et al.

ICCV 2025arXiv:2405.18525
26
citations
#121

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.

ICCV 2025arXiv:2504.08736
26
citations
#122

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.

ICCV 2025arXiv:2504.01724
26
citations
#123

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Walid Bousselham, Angie Boggust, Sofian Chaybouti et al.

ICCV 2025arXiv:2404.03214
25
citations
#124

RadGPT: Constructing 3D Image-Text Tumor Datasets

Pedro Bassi, Mehmet Yavuz, Ibrahim Ethem Hamamci et al.

ICCV 2025arXiv:2501.04678
25
citations
#125

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun yuan, Yaling Shen et al.

ICCV 2025arXiv:2411.15421
25
citations
#126

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

Zhangquan Chen, Xufang Luo, Dongsheng Li

ICCV 2025arXiv:2503.07523
25
citations
#127

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Yiwu Zhong, Zhuoming Liu, Yin Li et al.

ICCV 2025arXiv:2412.03248
24
citations
#128

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Zhefei Gong, Pengxiang Ding, Shangke Lyu et al.

ICCV 2025arXiv:2412.06782
24
citations
#129

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Xianfu Cheng, Wei Zhang, Shiwei Zhang et al.

ICCV 2025arXiv:2502.13059
24
citations
#130

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu, Linchao Zhu, Yi Yang

ICCV 2025arXiv:2410.12332
24
citations
#131

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Zhen Xing, Qi Dai, Zejia Weng et al.

ICCV 2025arXiv:2406.06465
24
citations
#132

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang, Chengqi Duan, Kun Wang et al.

ICCV 2025arXiv:2410.13861
24
citations
#133

MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

XINJIE ZHANG, Zhening Liu, Yifan Zhang et al.

ICCV 2025highlightarXiv:2410.13613
24
citations
#134

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han et al.

ICCV 2025arXiv:2501.01986
24
citations
#135

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Edgar Sucar, Zihang Lai, Eldar Insafutdinov et al.

ICCV 2025highlightarXiv:2503.16318
24
citations
#136

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki, Dongchan Min, Gyeongsu Chae

ICCV 2025arXiv:2412.01064
24
citations
#137

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Ke Fan, Shunlin Lu, Minyue Dai et al.

ICCV 2025highlightarXiv:2507.07095
23
citations
#138

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Massimiliano Viola, Kevin Qu, Nando Metzger et al.

ICCV 2025arXiv:2412.13389
23
citations
#139

Scalable Ranked Preference Optimization for Text-to-Image Generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata et al.

ICCV 2025arXiv:2410.18013
23
citations
#140

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Tianhao Wu, Chuanxia Zheng, Frank Guan et al.

ICCV 2025arXiv:2503.13439
23
citations
#141

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina et al.

ICCV 2025arXiv:2411.19331
23
citations
#142

Dataset Distillation via the Wasserstein Metric

Haoyang Liu, Peiran Wang, Yijiang Li et al.

ICCV 2025arXiv:2311.18531
23
citations
#143

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi et al.

ICCV 2025highlightarXiv:2506.18903
23
citations
#144

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Junyu Chen, Dongyun Zou, Wenkun He et al.

ICCV 2025arXiv:2508.00413
23
citations
#145

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du, Zhineng Chen, Hongtao Xie et al.

ICCV 2025arXiv:2411.15858
23
citations
#146

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.

ICCV 2025arXiv:2506.22139
23
citations
#147

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li, Chuanxia Zheng, Christian Rupprecht et al.

ICCV 2025arXiv:2408.04631
22
citations
#148

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul et al.

ICCV 2025arXiv:2503.12271
22
citations
#149

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, YaoYang Liu, Bin Xia et al.

ICCV 2025arXiv:2501.03931
22
citations
#150

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Yi Chen, Yuying Ge, Weiliang Tang et al.

ICCV 2025arXiv:2412.04445
22
citations
#151

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.

ICCV 2025arXiv:2503.19312
22
citations
#152

MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding, Wu Shenxi, Xiangyu Zhao et al.

ICCV 2025arXiv:2504.07957
22
citations
#153

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Boyu Chen, Zhengrong Yue, Siran Chen et al.

ICCV 2025arXiv:2503.10200
22
citations
#154

MotionFollower: Editing Video Motion via Score-Guided Diffusion

Shuyuan Tu, Qi Dai, Zihao Zhang et al.

ICCV 2025
22
citations
#155

IRASim: A Fine-Grained World Model for Robot Manipulation

Fangqi Zhu, Hongtao Wu, Song Guo et al.

ICCV 2025arXiv:2406.14540
22
citations
#156

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Yihan Cao, Jiazhao Zhang, Zhinan Yu et al.

ICCV 2025arXiv:2412.10439
22
citations
#157

Radiant Foam: Real-Time Differentiable Ray Tracing

Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi et al.

ICCV 2025highlightarXiv:2502.01157
22
citations
#158

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky et al.

ICCV 2025arXiv:2502.20220
22
citations
#159

FonTS: Text Rendering With Typography and Style Controls

Wenda SHI, Yiren Song, Dengming Zhang et al.

ICCV 2025arXiv:2412.00136
22
citations
#160

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Weiming Ren, Wentao Ma, Huan Yang et al.

ICCV 2025arXiv:2503.11579
22
citations
#161

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo, Yikang Ding, Xiwu Chen et al.

ICCV 2025arXiv:2503.15208
22
citations
#162

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu et al.

ICCV 2025arXiv:2507.01016
21
citations
#163

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Weitai Kang, Haifeng Huang, Yuzhang Shang et al.

ICCV 2025arXiv:2410.00255
21
citations
#164

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Ding Zhong, Xu Zheng, Chenfei Liao et al.

ICCV 2025highlightarXiv:2503.07098
21
citations
#165

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng, Xiaojun Jia, Ranjie Duan et al.

ICCV 2025arXiv:2412.05934
21
citations
#166

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

ICCV 2025arXiv:2412.13180
21
citations
#167

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Yujie Zhou, Jiazi Bu, Pengyang Ling et al.

ICCV 2025arXiv:2502.08590
21
citations
#168

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li, Qi Ma, Runyi Yang et al.

ICCV 2025arXiv:2503.18052
21
citations
#169

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen et al.

ICCV 2025arXiv:2412.07730
21
citations
#170

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei, Jiacong Wang, Haochen Wang et al.

ICCV 2025highlightarXiv:2504.10462
21
citations
#171

MMAD: Multi-label Micro-Action Detection in Videos

Kun Li, pengyu Liu, Dan Guo et al.

ICCV 2025arXiv:2407.05311
21
citations
#172

V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Zewei Zhou, Hao Xiang, Zhaoliang Zheng et al.

ICCV 2025arXiv:2412.01812
21
citations
#173

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan et al.

ICCV 2025arXiv:2504.07960
21
citations
#174

Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

Yingyu Liang, Zhizhou Sha, Zhenmei Shi et al.

ICCV 2025arXiv:2405.16418
21
citations
#175

RayZer: A Self-supervised Large View Synthesis Model

Hanwen Jiang, Hao Tan, Peng Wang et al.

ICCV 2025arXiv:2505.00702
21
citations
#176

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu et al.

ICCV 2025arXiv:2504.01016
20
citations
#177

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Yuseung Lee, Jihyeon Je, Chanho Park et al.

ICCV 2025arXiv:2504.17207
20
citations
#178

Video-T1: Test-time Scaling for Video Generation

Fangfu Liu, Hanyang Wang, Yimo Cai et al.

ICCV 2025arXiv:2503.18942
20
citations
#179

Towards a Unified Copernicus Foundation Model for Earth Vision

Yi Wang, Zhitong Xiong, Chenying Liu et al.

ICCV 2025arXiv:2503.11849
20
citations
#180

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

ICCV 2025arXiv:2504.15485
20
citations
#181

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni, Xiaofeng Wang, Zheng Zhu et al.

ICCV 2025arXiv:2504.02261
20
citations
#182

SynCity: Training-Free Generation of 3D Cities

Paul Engstler, Aleksandar Shtedritski, Iro Laina et al.

ICCV 2025
20
citations
#183

Revelio: Interpreting and leveraging semantic information in diffusion models

Dahye Kim, Xavier Thomas, Deepti Ghadiyaram

ICCV 2025arXiv:2411.16725
20
citations
#184

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li, Zhen Xing, Rui Wang et al.

ICCV 2025arXiv:2503.16421
20
citations
#185

PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

Fatemeh Ghezloo, Saygin Seyfioglu, Rustin Soraki et al.

ICCV 2025arXiv:2502.08916
20
citations
#186

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

Yuhui WU, Liyi Chen, Ruibin Li et al.

ICCV 2025arXiv:2503.20287
20
citations
#187

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Xinran Ling, Chen Zhu, Meiqi Wu et al.

ICCV 2025arXiv:2503.10076
20
citations
#188

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan et al.

ICCV 2025arXiv:2501.14607
19
citations
#189

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong, Chengyao Wang, Yuqi Liu et al.

ICCV 2025arXiv:2412.09501
19
citations
#190

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Xiyao Wang, Zhengyuan Yang, Linjie Li et al.

ICCV 2025arXiv:2412.03704
19
citations
#191

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei et al.

ICCV 2025arXiv:2412.09626
19
citations
#192

NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments

Xuan Yao, Junyu Gao, Changsheng Xu

ICCV 2025arXiv:2506.23468
19
citations
#193

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Danila Rukhovich, Elona Dupont, Dimitrios Mallis et al.

ICCV 2025arXiv:2412.14042
19
citations
#194

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu, Siyuan Meng, Yanting Gao et al.

ICCV 2025arXiv:2503.12972
19
citations
#195

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus, Carl Doersch, Yi Yang et al.

ICCV 2025arXiv:2504.05579
19
citations
#196

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Hengjia Li, Haonan Qiu, Shiwei Zhang et al.

ICCV 2025arXiv:2411.17048
19
citations
#197

Neighboring Autoregressive Modeling for Efficient Visual Generation

Yefei He, Yuanyu He, Shaoxuan He et al.

ICCV 2025arXiv:2503.10696
19
citations
#198

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen

ICCV 2025arXiv:2411.17473
19
citations
#199

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Haiwen Diao, Xiaotong Li, Yufeng Cui et al.

ICCV 2025highlightarXiv:2502.06788
19
citations
#200

Scalable Image Tokenization with Index Backpropagation Quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge et al.

ICCV 2025arXiv:2412.02692
19
citations
PreviousNext