NEURIPS "vision-language models" Papers

131 papers found • Page 3 of 3

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Ehsan Latif, Zirak Khan, Xiaoming Zhai

NEURIPS 2025posterarXiv:2507.22904

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Wufei Ma, Yu-Cheng Chou, Qihao Liu et al.

NEURIPS 2025posterarXiv:2504.20024
21
citations

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Haoyu Zhang, Meng Liu, Zaijing Li et al.

NEURIPS 2025spotlightarXiv:2506.03642
7
citations

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Yang Liu, Ming Ma, Xiaomin Yu et al.

NEURIPS 2025posterarXiv:2505.12448
19
citations

Statistics Caching Test-Time Adaptation for Vision-Language Models

Zenghao Guan, Yucan Zhou, Wu Liu et al.

NEURIPS 2025poster

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin et al.

NEURIPS 2025oralarXiv:2506.06218
4
citations

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.

NEURIPS 2025poster

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.

NEURIPS 2025oralarXiv:2507.02001
8
citations

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori, David OSOWIECHI, Gustavo Vargas Hakim et al.

NEURIPS 2025posterarXiv:2505.21844
4
citations

Text to Sketch Generation with Multi-Styles

Tengjie Li, Shikui Tu, Lei Xu

NEURIPS 2025posterarXiv:2511.04123

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng, Jian Liang, Ran He et al.

NEURIPS 2025posterarXiv:2506.24000
1
citations

The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models

Alessandro Serra, Francesco Ortu, Emanuele Panizon et al.

NEURIPS 2025posterarXiv:2412.06646
1
citations

Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models

Shenglong Zhou, Manjiang Yin, Leiyu Sun et al.

NEURIPS 2025poster

TRAP: Targeted Redirecting of Agentic Preferences

Hangoo Kang, Jehyeok Yeon, Gagandeep Singh

NEURIPS 2025posterarXiv:2505.23518
2
citations

Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

jusheng zhang, Yijia Fan, Zimo Wen et al.

NEURIPS 2025poster

TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.

NEURIPS 2025oralarXiv:2512.01048

Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro et al.

NEURIPS 2025posterarXiv:2510.19307
2
citations

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng, Kai Han

NEURIPS 2025posterarXiv:2511.22664
1
citations

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen et al.

NEURIPS 2025poster

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.

NEURIPS 2025posterarXiv:2505.15952
4
citations

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NEURIPS 2025posterarXiv:2505.01481
13
citations

VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou et al.

NEURIPS 2025posterarXiv:2506.09049
8
citations

Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan et al.

NEURIPS 2025spotlightarXiv:2502.00791
7
citations

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.

NEURIPS 2025posterarXiv:2507.07104
2
citations

Vision Transformers Don't Need Trained Registers

Nicholas Jiang, Amil Dravid, Alexei Efros et al.

NEURIPS 2025spotlightarXiv:2506.08010
12
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025posterarXiv:2509.15235
2
citations

VLMs can Aggregate Scattered Training Patches

Zhanhui Zhou, Lingjie Chen, Chao Yang et al.

NEURIPS 2025posterarXiv:2506.03614

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NEURIPS 2025spotlightarXiv:2504.08837
175
citations

Vocabulary-Guided Gait Recognition

Panjian Huang, Saihui Hou, Chunshui Cao et al.

NEURIPS 2025poster

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng et al.

NEURIPS 2025posterarXiv:2509.25033
2
citations

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li, Hongjian Zhan, Tianyi Chen et al.

NEURIPS 2025posterarXiv:2505.10118
1
citations