NEURIPS "vision-language models" Papers
131 papers found • Page 3 of 3
SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches
Ehsan Latif, Zirak Khan, Xiaoming Zhai
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Wufei Ma, Yu-Cheng Chou, Qihao Liu et al.
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data
Haoyu Zhang, Meng Liu, Zaijing Li et al.
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Yang Liu, Ming Ma, Xiaomin Yu et al.
Statistics Caching Test-Time Adaptation for Vision-Language Models
Zenghao Guan, Yucan Zhou, Wu Liu et al.
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving
Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin et al.
TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.
Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
Mehrdad Noori, David OSOWIECHI, Gustavo Vargas Hakim et al.
Text to Sketch Generation with Multi-Styles
Tengjie Li, Shikui Tu, Lei Xu
The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models
Lijun Sheng, Jian Liang, Ran He et al.
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Alessandro Serra, Francesco Ortu, Emanuele Panizon et al.
Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models
Shenglong Zhou, Manjiang Yin, Leiyu Sun et al.
TRAP: Targeted Redirecting of Agentic Preferences
Hangoo Kang, Jehyeok Yeon, Gagandeep Singh
Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation
jusheng zhang, Yijia Fan, Zimo Wen et al.
TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models
Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro et al.
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng, Kai Han
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Run Luo, Renke Shan, Longze Chen et al.
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi et al.
VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou et al.
Vision-centric Token Compression in Large Language Model
Ling Xing, Alex Jinpeng Wang, Rui Yan et al.
Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.
Vision Transformers Don't Need Trained Registers
Nicholas Jiang, Amil Dravid, Alexei Efros et al.
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Jialiang Kang, Han Shu, Wenshuo Li et al.
VLMs can Aggregate Scattered Training Patches
Zhanhui Zhou, Lingjie Chen, Chao Yang et al.
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang et al.
Vocabulary-Guided Gait Recognition
Panjian Huang, Saihui Hou, Chunshui Cao et al.
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Wenhao Li, Qiangchang Wang, Xianjing Meng et al.
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
Yangfu Li, Hongjian Zhan, Tianyi Chen et al.