Most Cited ICCV Spotlight "spatiotemporal dynamics modeling" Papers

2,701 papers found • Page 1 of 14

#1

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025posterarXiv:2503.01785
347
citations
#2

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025posterarXiv:2411.10440
338
citations
#3

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

yi yang, Xiaoxuan He, Hongkun Pan et al.

ICCV 2025posterarXiv:2503.10615
247
citations
#4

OminiControl: Minimal and Universal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.

ICCV 2025highlightarXiv:2411.15098
214
citations
#5

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

ICCV 2025highlightarXiv:2410.11831
213
citations
#6

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, zehai he, Wenyi Hong et al.

ICCV 2025highlightarXiv:2406.08035
208
citations
#7

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

ICCV 2025posterarXiv:2503.12937
206
citations
#8

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

ICCV 2025posterarXiv:2503.07598
169
citations
#9

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

CHENMING ZHU, Tai Wang, Wenwei Zhang et al.

ICCV 2025poster
127
citations
#10

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu et al.

ICCV 2025posterarXiv:2406.08451
96
citations
#11

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

ICCV 2025poster
89
citations
#12

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin, Jianwen Jiang, Jiaqi Yang et al.

ICCV 2025highlightarXiv:2502.01061
86
citations
#13

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou, Hang Gao, Vikram Voleti et al.

ICCV 2025posterarXiv:2503.14489
83
citations
#14

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025posterarXiv:2412.03632
73
citations
#15

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.

ICCV 2025posterarXiv:2504.10483
73
citations
#16

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025posterarXiv:2501.04003
71
citations
#17

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025posterarXiv:2503.07027
70
citations
#18

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

ICCV 2025highlightarXiv:2501.08325
63
citations
#19

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.

ICCV 2025posterarXiv:2312.12491
62
citations
#20

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025posterarXiv:2503.19755
62
citations
#21

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Xuemeng Yang, Licheng Wen, Tiantian Wei et al.

ICCV 2025posterarXiv:2408.00415
58
citations
#22

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Chen Ziwen, Hao Tan, Kai Zhang et al.

ICCV 2025highlightarXiv:2410.12781
56
citations
#23

Long Context Tuning for Video Generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.

ICCV 2025posterarXiv:2503.10589
56
citations
#24

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong et al.

ICCV 2025posterarXiv:2410.16268
56
citations
#25

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025highlightarXiv:2502.11079
55
citations
#26

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Wangbo Yu, Chaoran Feng, Jianing Li et al.

ICCV 2025posterarXiv:2405.20224
53
citations
#27

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin, Le Zhuo, Yi Xin et al.

ICCV 2025posterarXiv:2503.21758
52
citations
#28

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Yingyan Li, Yuqi Wang, Yang Liu et al.

ICCV 2025posterarXiv:2504.01941
50
citations
#29

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025posterarXiv:2504.16072
49
citations
#30

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Sucheng Ren, Qihang Yu, Ju He et al.

ICCV 2025posterarXiv:2502.20388
49
citations
#31

Aether: Geometric-Aware Unified World Modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.

ICCV 2025posterarXiv:2503.18945
47
citations
#32

WorldScore: Unified Evaluation Benchmark for World Generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.

ICCV 2025poster
46
citations
#33

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

ICCV 2025posterarXiv:2412.18607
44
citations
#34

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao et al.

ICCV 2025posterarXiv:2411.13807
44
citations
#35

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin et al.

ICCV 2025posterarXiv:2411.02038
44
citations
#36

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Junjie He, Yifeng Geng, Liefeng Bo

ICCV 2025posterarXiv:2408.05939
43
citations
#37

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Chunwei Wang, Guansong Lu, Junwei Yang et al.

ICCV 2025posterarXiv:2412.06673
43
citations
#38

Learning 4D Embodied World Models

Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.

ICCV 2025posterarXiv:2504.20995
43
citations
#39

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Zhibo Yang, Jun Tang, Zhaohai Li et al.

ICCV 2025posterarXiv:2412.02210
42
citations
#40

St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

Haiwen Feng, Junyi Zhang, Qianqian Wang et al.

ICCV 2025posterarXiv:2504.13152
41
citations
#41

PartField: Learning 3D Feature Fields for Part Segmentation and Beyond

Minghua Liu, Mikaela Uy, Donglai Xiang et al.

ICCV 2025posterarXiv:2504.11451
40
citations
#42

Scaling Language-Free Visual Representation Learning

David Fan, Shengbang Tong, Jiachen Zhu et al.

ICCV 2025highlightarXiv:2504.01017
39
citations
#43

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

Alexander Mai, Peter Hedman, George Kopanas et al.

ICCV 2025posterarXiv:2410.01804
38
citations
#44

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang et al.

ICCV 2025posterarXiv:2503.17973
38
citations
#45

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Li Hu, wang yuan, Zhen Shen et al.

ICCV 2025posterarXiv:2502.06145
38
citations
#46

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Wanshui Gan, Fang Liu, Hongbin Xu et al.

ICCV 2025posterarXiv:2408.11447
37
citations
#47

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu et al.

ICCV 2025posterarXiv:2412.01818
37
citations
#48

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao et al.

ICCV 2025highlightarXiv:2503.09641
37
citations
#49

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

ICCV 2025posterarXiv:2406.17840
36
citations
#50

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Yun Li, Yiming Zhang, Tao Lin et al.

ICCV 2025posterarXiv:2503.23765
36
citations
#51

Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

Yuqi Li, Chuanguang Yang, Hansheng Zeng et al.

ICCV 2025posterarXiv:2507.02939
36
citations
#52

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Zhi Hou, Tianyi Zhang, Yuwen Xiong et al.

ICCV 2025posterarXiv:2503.19757
36
citations
#53

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Mark YU, Wenbo Hu, Jinbo Xing et al.

ICCV 2025posterarXiv:2503.05638
35
citations
#54

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao, James Jun Liang Chen Ye, Zhengyi Wang et al.

ICCV 2025posterarXiv:2503.15265
35
citations
#55

YOLOE: Real-Time Seeing Anything

Ao Wang, Lihao Liu, Hui Chen et al.

ICCV 2025posterarXiv:2503.07465
34
citations
#56

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu et al.

ICCV 2025posterarXiv:2503.06923
34
citations
#57

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu et al.

ICCV 2025posterarXiv:2503.21979
33
citations
#58

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025posterarXiv:2412.03859
33
citations
#59

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

ICCV 2025highlightarXiv:2503.15264
32
citations
#60

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025posterarXiv:2412.10471
31
citations
#61

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

yifei xia, Suhan Ling, Fangcheng Fu et al.

ICCV 2025posterarXiv:2502.21079
30
citations
#62

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

Hailong Guo, Bohan Zeng, Yiren Song et al.

ICCV 2025posterarXiv:2501.15891
30
citations
#63

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Liming Jiang, Qing Yan, Yumin Jia et al.

ICCV 2025highlightarXiv:2503.16418
29
citations
#64

Long-Context State-Space Video World Models

Ryan Po, Yotam Nitzan, Richard Zhang et al.

ICCV 2025posterarXiv:2505.20171
28
citations
#65

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

ICCV 2025posterarXiv:2504.16080
28
citations
#66

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang et al.

ICCV 2025posterarXiv:2504.01901
28
citations
#67

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025posterarXiv:2501.04931
28
citations
#68

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang, Yuan Liu, Ziwei Liu et al.

ICCV 2025posterarXiv:2410.16892
27
citations
#69

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Yiren Song, Danze Chen, Mike Zheng Shou

ICCV 2025posterarXiv:2502.01105
26
citations
#70

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.

ICCV 2025posterarXiv:2406.14240
25
citations
#71

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song et al.

ICCV 2025posterarXiv:2411.17762
25
citations
#72

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Walid Bousselham, Angie Boggust, Sofian Chaybouti et al.

ICCV 2025posterarXiv:2404.03214
24
citations
#73

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Zhen Xing, Qi Dai, Zejia Weng et al.

ICCV 2025posterarXiv:2406.06465
24
citations
#74

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

ICCV 2025highlightarXiv:2411.19325
24
citations
#75

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang, Chengqi Duan, Kun Wang et al.

ICCV 2025posterarXiv:2410.13861
24
citations
#76

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025highlightarXiv:2507.04047
24
citations
#77

VSSD: Vision Mamba with Non-Causal State Space Duality

Yuheng Shi, Mingjia Li, Minjing Dong et al.

ICCV 2025posterarXiv:2407.18559
24
citations
#78

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun yuan, Yaling Shen et al.

ICCV 2025posterarXiv:2411.15421
23
citations
#79

RadGPT: Constructing 3D Image-Text Tumor Datasets

Pedro Bassi, Mehmet Yavuz, Ibrahim Ethem Hamamci et al.

ICCV 2025posterarXiv:2501.04678
23
citations
#80

Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu et al.

ICCV 2025posterarXiv:2506.24113
23
citations
#81

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai, Jiangning Zhang, Haoyang He et al.

ICCV 2025posterarXiv:2410.16236
23
citations
#82

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Kyle Sargent, Kyle Hsu, Justin Johnson et al.

ICCV 2025posterarXiv:2503.11056
23
citations
#83

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.

ICCV 2025posterarXiv:2506.22139
23
citations
#84

MotionFollower: Editing Video Motion via Score-Guided Diffusion

Shuyuan Tu, Qi Dai, Zihao Zhang et al.

ICCV 2025poster
22
citations
#85

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.

ICCV 2025posterarXiv:2504.08736
22
citations
#86

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.

ICCV 2025posterarXiv:2504.01724
22
citations
#87

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Rui Xie, Yinhong Liu, Penghao Zhou et al.

ICCV 2025posterarXiv:2501.02976
22
citations
#88

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.

ICCV 2025posterarXiv:2503.18438
22
citations
#89

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina et al.

ICCV 2025posterarXiv:2411.19331
22
citations
#90

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han et al.

ICCV 2025posterarXiv:2501.01986
22
citations
#91

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Tianhao Wu, Chuanxia Zheng, Frank Guan et al.

ICCV 2025posterarXiv:2503.13439
22
citations
#92

FonTS: Text Rendering With Typography and Style Controls

Wenda SHI, Yiren Song, Dengming Zhang et al.

ICCV 2025posterarXiv:2412.00136
21
citations
#93

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, YaoYang Liu, Bin Xia et al.

ICCV 2025posterarXiv:2501.03931
21
citations
#94

MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

XINJIE ZHANG, Zhening Liu, Yifan Zhang et al.

ICCV 2025highlightarXiv:2410.13613
21
citations
#95

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng, Xiaojun Jia, Ranjie Duan et al.

ICCV 2025posterarXiv:2412.05934
21
citations
#96

IRASim: A Fine-Grained World Model for Robot Manipulation

Fangqi Zhu, Hongtao Wu, Song Guo et al.

ICCV 2025posterarXiv:2406.14540
21
citations
#97

Scalable Ranked Preference Optimization for Text-to-Image Generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata et al.

ICCV 2025posterarXiv:2410.18013
21
citations
#98

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Yiwu Zhong, Zhuoming Liu, Yin Li et al.

ICCV 2025posterarXiv:2412.03248
21
citations
#99

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.

ICCV 2025posterarXiv:2503.19312
21
citations
#100

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

Yuhui WU, Liyi Chen, Ruibin Li et al.

ICCV 2025posterarXiv:2503.20287
21
citations
#101

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul et al.

ICCV 2025posterarXiv:2503.12271
21
citations
#102

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo, Yikang Ding, Xiwu Chen et al.

ICCV 2025posterarXiv:2503.15208
21
citations
#103

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Boyu Chen, Zhengrong Yue, Siran Chen et al.

ICCV 2025posterarXiv:2503.10200
21
citations
#104

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen et al.

ICCV 2025posterarXiv:2412.07730
20
citations
#105

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan et al.

ICCV 2025posterarXiv:2504.07960
20
citations
#106

V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Zewei Zhou, Hao Xiang, Zhaoliang Zheng et al.

ICCV 2025posterarXiv:2412.01812
20
citations
#107

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi et al.

ICCV 2025highlightarXiv:2506.18903
20
citations
#108

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Massimiliano Viola, Kevin Qu, Nando Metzger et al.

ICCV 2025posterarXiv:2412.13389
20
citations
#109

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei, Jiacong Wang, Haochen Wang et al.

ICCV 2025highlightarXiv:2504.10462
20
citations
#110

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Ding Zhong, Xu Zheng, Chenfei Liao et al.

ICCV 2025highlightarXiv:2503.07098
20
citations
#111

PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

Fatemeh Ghezloo, Saygin Seyfioglu, Rustin Soraki et al.

ICCV 2025posterarXiv:2502.08916
20
citations
#112

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni, Xiaofeng Wang, Zheng Zhu et al.

ICCV 2025posterarXiv:2504.02261
20
citations
#113

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li, Qi Ma, Runyi Yang et al.

ICCV 2025posterarXiv:2503.18052
20
citations
#114

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

ICCV 2025posterarXiv:2504.15485
19
citations
#115

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu et al.

ICCV 2025posterarXiv:2504.01016
19
citations
#116

Towards a Unified Copernicus Foundation Model for Earth Vision

Yi Wang, Zhitong Xiong, Chenying Liu et al.

ICCV 2025posterarXiv:2503.11849
19
citations
#117

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li, Zhen Xing, Rui Wang et al.

ICCV 2025posterarXiv:2503.16421
19
citations
#118

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs

Qiucheng Wu, Handong Zhao, Michael Saxon et al.

ICCV 2025poster
18
citations
#119

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Haiwen Diao, Xiaotong Li, Yufeng Cui et al.

ICCV 2025highlightarXiv:2502.06788
18
citations
#120

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

Jiale Xu, Shenghua Gao, Ying Shan

ICCV 2025posterarXiv:2412.09573
18
citations
#121

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Danila Rukhovich, Elona Dupont, Dimitrios Mallis et al.

ICCV 2025posterarXiv:2412.14042
17
citations
#122

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Yuqing Wang, Zhijie Lin, Yao Teng et al.

ICCV 2025posterarXiv:2503.16430
17
citations
#123

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

Li, Nikolaos Tsagkas, Jifei Song et al.

ICCV 2025posterarXiv:2408.10123
17
citations
#124

LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Jieming Bian, Lei Wang, Letian Zhang et al.

ICCV 2025posterarXiv:2411.14961
17
citations
#125

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang, Baolu Li, Yiming Zhang et al.

ICCV 2025posterarXiv:2503.23368
17
citations
#126

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

Xiaobiao Du, Yida Wang, Haiyang Sun et al.

ICCV 2025posterarXiv:2406.04875
17
citations
#127

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Xiangxiang Chu, Renda Li, Yong Wang

ICCV 2025posterarXiv:2503.06132
17
citations
#128

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen et al.

ICCV 2025posterarXiv:2503.20785
17
citations
#129

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen

ICCV 2025posterarXiv:2411.17473
17
citations
#130

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

Hongcheng Gao, Tianyu Pang, Chao Du et al.

ICCV 2025posterarXiv:2410.12777
17
citations
#131

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

Zhenxin Li, Shihao Wang, Shiyi Lan et al.

ICCV 2025posterarXiv:2503.12030
17
citations
#132

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu et al.

ICCV 2025posterarXiv:2507.01016
16
citations
#133

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He, Danshi Li, Xinqiang Yu et al.

ICCV 2025highlightarXiv:2507.02747
16
citations
#134

Neighboring Autoregressive Modeling for Efficient Visual Generation

Yefei He, Yuanyu He, Shaoxuan He et al.

ICCV 2025posterarXiv:2503.10696
16
citations
#135

Scalable Image Tokenization with Index Backpropagation Quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge et al.

ICCV 2025posterarXiv:2412.02692
16
citations
#136

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

ICCV 2025posterarXiv:2412.17007
16
citations
#137

ViLLa: Video Reasoning Segmentation with Large Language Model

rongkun Zheng, Lu Qi, Xi Chen et al.

ICCV 2025posterarXiv:2407.14500
16
citations
#138

NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments

Xuan Yao, Junyu Gao, Changsheng Xu

ICCV 2025posterarXiv:2506.23468
16
citations
#139

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.

ICCV 2025posterarXiv:2412.04380
16
citations
#140

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang et al.

ICCV 2025posterarXiv:2411.14384
16
citations
#141

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

Haoxuan Wang, Jinlong Peng, Qingdong He et al.

ICCV 2025posterarXiv:2503.09277
16
citations
#142

Video-T1: Test-time Scaling for Video Generation

Fangfu Liu, Hanyang Wang, Yimo Cai et al.

ICCV 2025posterarXiv:2503.18942
15
citations
#143

AllTracker: Efficient Dense Point Tracking at High Resolution

Adam Harley, Yang You, Yang Zheng et al.

ICCV 2025posterarXiv:2506.07310
15
citations
#144

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He et al.

ICCV 2025posterarXiv:2508.02095
15
citations
#145

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Dewei Zhou, Mingwei Li, Zongxin Yang et al.

ICCV 2025posterarXiv:2503.12885
15
citations
#146

BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

David Svitov, Pietro Morerio, Lourdes Agapito et al.

ICCV 2025posterarXiv:2411.08508
15
citations
#147

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Xinran Ling, Chen Zhu, Meiqi Wu et al.

ICCV 2025posterarXiv:2503.10076
15
citations
#148

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

Simon Boeder, Fabian Gigengack, Benjamin Risse

ICCV 2025posterarXiv:2502.17288
15
citations
#149

An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.

ICCV 2025posterarXiv:2501.05453
15
citations
#150

FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro et al.

ICCV 2025posterarXiv:2501.02430
15
citations
#151

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Yongsheng Yu, Ziyun Zeng, Haitian Zheng et al.

ICCV 2025posterarXiv:2503.08677
15
citations
#152

V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Jianqi Chen, Biao Zhang, Xiangjun Tang et al.

ICCV 2025posterarXiv:2503.09631
15
citations
#153

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

JIACHENG RUAN, Wenzhen Yuan, Xian Gao et al.

ICCV 2025posterarXiv:2503.07478
15
citations
#154

LONG3R: Long Sequence Streaming 3D Reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan et al.

ICCV 2025posterarXiv:2507.18255
14
citations
#155

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.

ICCV 2025posterarXiv:2502.01720
14
citations
#156

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen, Bohan Liu, Chenjia Li et al.

ICCV 2025posterarXiv:2503.21056
14
citations
#157

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye et al.

ICCV 2025posterarXiv:2412.09530
14
citations
#158

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Yi Chen, Yuying Ge, Weiliang Tang et al.

ICCV 2025posterarXiv:2412.04445
14
citations
#159

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song et al.

ICCV 2025posterarXiv:2501.10913
14
citations
#160

Unleashing Vecset Diffusion Model for Fast Shape Generation

Zeqiang Lai, Zhao Yunfei, Zibo Zhao et al.

ICCV 2025highlightarXiv:2503.16302
14
citations
#161

CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance

Jinming Li, Yichen Zhu, Zhibin Tang et al.

ICCV 2025poster
13
citations
#162

Efficient Track Anything

Yunyang Xiong, Chong Zhou, Xiaoyu Xiang et al.

ICCV 2025posterarXiv:2411.18933
13
citations
#163

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao et al.

ICCV 2025posterarXiv:2501.05020
13
citations
#164

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Feng yan, Fanfan Liu, Yiyang Huang et al.

ICCV 2025posterarXiv:2412.07215
13
citations
#165

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

Jiarui Wang, Huiyu Duan, Yu Zhao et al.

ICCV 2025highlightarXiv:2504.08358
13
citations
#166

Detect Anything 3D in the Wild

Hanxue Zhang, Haoran Jiang, Qingsong Yao et al.

ICCV 2025posterarXiv:2504.07958
13
citations
#167

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.

ICCV 2025posterarXiv:2507.02664
13
citations
#168

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Yuping Wang, Xiangyu Huang, Xiaokang Sun et al.

ICCV 2025posterarXiv:2503.24381
13
citations
#169

Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

Soonbin Lee, Fangwen Shu, Yago Sanchez de la Fuente et al.

ICCV 2025posterarXiv:2501.03399
13
citations
#170

Referring to Any Person

Qing Jiang, Lin Wu, Zhaoyang Zeng et al.

ICCV 2025posterarXiv:2503.08507
13
citations
#171

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Jiale Cheng, Ruiliang Lyu, Xiaotao Gu et al.

ICCV 2025posterarXiv:2503.20491
13
citations
#172

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li, Meng Tian, Zhenyu Lin et al.

ICCV 2025posterarXiv:2503.21505
12
citations
#173

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel et al.

ICCV 2025posterarXiv:2410.14462
12
citations
#174

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Sung-Yeon Park, Can Cui, Yunsheng Ma et al.

ICCV 2025posterarXiv:2503.12772
12
citations
#175

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang et al.

ICCV 2025posterarXiv:2503.07588
12
citations
#176

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Yabo Zhang, xinpeng zhou, Yihan Zeng et al.

ICCV 2025posterarXiv:2501.08225
12
citations
#177

Mobile Video Diffusion

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas et al.

ICCV 2025posterarXiv:2412.07583
12
citations
#178

MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

Hui Sun, Shiyin Lu, Huanyu Wang et al.

ICCV 2025posterarXiv:2501.02885
12
citations
#179

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul et al.

ICCV 2025posterarXiv:2410.10780
12
citations
#180

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

zhiliang wu, Kerui Chen, Kun Li et al.

ICCV 2025posterarXiv:2502.01181
12
citations
#181

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Ming Li, Xin Gu, Fan Chen et al.

ICCV 2025posterarXiv:2505.02370
12
citations
#182

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei, Yujie Zhong, yingsen zeng et al.

ICCV 2025posterarXiv:2412.14006
12
citations
#183

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei, Yifan Zhou, Dongdong Chen et al.

ICCV 2025posterarXiv:2503.16153
12
citations
#184

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang, Zhijian Chen, Kai Lan et al.

ICCV 2025posterarXiv:2503.14478
12
citations
#185

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.

ICCV 2025posterarXiv:2507.07990
12
citations
#186

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Chin-Yang Lin, Cheng Sun, Fu-En Yang et al.

ICCV 2025posterarXiv:2508.14041
12
citations
#187

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Weirong Chen, Ganlin Zhang, Felix Wimbauer et al.

ICCV 2025posterarXiv:2504.14516
12
citations
#188

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang et al.

ICCV 2025posterarXiv:2503.10406
12
citations
#189

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu HU, Na Zhao et al.

ICCV 2025posterarXiv:2502.02358
12
citations
#190

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Yulin Pan, Xiangteng He, Chaojie Mao et al.

ICCV 2025posterarXiv:2503.14482
12
citations
#191

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Jiawen Zhu, YEW-SOON ONG, Chunhua Shen et al.

ICCV 2025posterarXiv:2410.10289
12
citations
#192

An OpenMind for 3D Medical Vision Self-supervised Learning

Tassilo Wald, Constantin Ulrich, Jonathan Suprijadi et al.

ICCV 2025posterarXiv:2412.17041
12
citations
#193

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Yatian Pang, Bin Zhu, Bin Lin et al.

ICCV 2025posterarXiv:2412.00397
12
citations
#194

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

ICCV 2025posterarXiv:2411.09219
12
citations
#195

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Tsu-Jui Fu, Yusu Qian, Chen Chen et al.

ICCV 2025posterarXiv:2503.12652
12
citations
#196

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Yue Fan, Xiaojian Ma, Rongpeng Su et al.

ICCV 2025highlightarXiv:2501.00358
11
citations
#197

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Zhijian Huang, Chengjian Feng, Baihui Xiao et al.

ICCV 2025posterarXiv:2412.07689
11
citations
#198

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Yiran Qin, Li Kang, Xiufeng Song et al.

ICCV 2025posterarXiv:2503.16408
11
citations
#199

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.

ICCV 2025highlightarXiv:2503.07535
11
citations
#200

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen, Zehuan Wu, Yichen Liu et al.

ICCV 2025posterarXiv:2412.04842
11
citations
PreviousNext