ICLR 2025 Papers
3,827 papers found • Page 75 of 77
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
Serin Yang, Taesung Kwon, Jong Chul YE
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh et al.
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang et al.
VideoGLUE: Video General Understanding Evaluation of Foundation Models
Boqing Gong, Yin Cui, Long Zhao et al.
VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan et al.
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Wentao Zhang, Junliang Guo, Tianyu He et al.
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie et al.
VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking
Runyi Hu, Jie Zhang, Yiming Li et al.
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar, Xiaohan Wang, Yonatan Bitton et al.
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang, Yinheng Li, Dan Zhao et al.
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Tianchen Zhao, Tongcheng Fang, Haofeng Huang et al.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen et al.
ViSAGe: Video-to-Spatial Audio Generation
Jaeyeon Kim, Heeseung Yun, Gunhee Kim
Vision and Language Synergy for Rehearsal Free Continual Learning
Muhammad Anwar Masum, Mahardhika Pratama, Savitha Ramasamy et al.
Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
Yudi Xie, Weichen Huang, Esther Alter et al.
Vision Language Models are In-Context Value Learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.
Vision-LSTM: xLSTM as Generic Vision Backbone
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Yuchen Duan, Weiyun Wang, Zhe Chen et al.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar et al.
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Tsung-Han Wu, Giscard Biamby, Jerome Quenum et al.
Visually Consistent Hierarchical Image Classification
Seulki Park, Youren Zhang, Stella Yu et al.
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Donghoon Kim, Minji Bae, Kyuhong Shim et al.
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni, YuTao Fan, Lei Zhang et al.
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning
Yichao Liang, Nishanth Kumar, Hao Tang et al.
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
Wei Zhao, Pengxiang Ding, Zhang Min et al.
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu et al.
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang et al.
VLMaterial: Procedural Material Generation with Large Vision-Language Models
Beichen Li, Rundi Wu, Armando Solar-Lezama et al.
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Xize Cheng, Ruofan Hu, Xiaoda Yang et al.
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Yumeng Li, William H Beluch, Margret Keuper et al.
VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning
Qingtao Liu, Yu Cui, Zhengnan Sun et al.
VVC-Gym: A Fixed-Wing UAV Reinforcement Learning Environment for Multi-Goal Long-Horizon Problems
Xudong Gong, Feng Dawei, Kele Xu et al.
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Katie Matton, Robert Ness, John Guttag et al.
Ward: Provable RAG Dataset Inference via LLM Watermarks
Nikola Jovanović, Robin Staab, Maximilian Baader et al.
WardropNet: Traffic Flow Predictions via Equilibrium-Augmented Learning
Kai Jungel, Dario Paccagnan, Axel Parmentier et al.
Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models
Hao-Chien Hsueh, Wen-Hsiao Peng, Ching-Chun Huang
Wasserstein Distances, Neuronal Entanglement, and Sparsity
Shashata Sawmya, Linghao Kong, Ilia Markov et al.
Wasserstein-Regularized Conformal Prediction under General Distribution Shift
Rui Xu, Chao Chen, Yue Sun et al.
Watch Less, Do More: Implicit Skill Discovery for Video-Conditioned Policy
Wang, Zongqing Lu
Watermark Anything With Localized Messages
Tom Sander, Pierre Fernandez, Alain Oliviero Durmus et al.
Wavelet-based Positional Representation for Long Context
Yui Oka, Taku Hasegawa, Kyosuke Nishida et al.
Wavelet Diffusion Neural Operator
Peiyan Hu, Rui Wang, Xiang Zheng et al.
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Shengpeng Ji, Ziyue Jiang, Wen Wang et al.
Wayward Concepts In Multimodal Models
Brandon Trabucco, Max Gurinas, Kyle Doherty et al.
Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
Peiran Xu, Yadong MU