NEURIPS Oral "vision-language models" Papers

16 papers found

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Xiaotang Gai, Jiaxiang Liu, Yichen Li et al.

NEURIPS 2025oralarXiv:2506.11147
4
citations

CrypticBio: A Large Multimodal Dataset for Visually Confusing Species

Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren

NEURIPS 2025oral

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria, Adinath Dukre, feilong tang et al.

NEURIPS 2025oralarXiv:2506.15649

Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong et al.

NEURIPS 2025oralarXiv:2506.07497
9
citations

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng, Zhengqin Xu, Qingyang Liu et al.

NEURIPS 2025oralarXiv:2510.20322
1
citations

Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination

Rakshit Trivedi, Kartik Sharma, David Parkes

NEURIPS 2025oral

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.

NEURIPS 2025oralarXiv:2509.17429

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea, Jun Li, Philipp Raffler et al.

NEURIPS 2025oral
7
citations

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen et al.

NEURIPS 2025oral

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
45
citations

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

Ruiqi Wang, Dezhong Zhao, Ziqin Yuan et al.

NEURIPS 2025oralarXiv:2509.15607

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder, Ondrej Biza, Thomas Weng et al.

NEURIPS 2025oralarXiv:2508.01943

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.

NEURIPS 2025oralarXiv:2509.01907
2
citations

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin et al.

NEURIPS 2025oralarXiv:2506.06218
4
citations

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.

NEURIPS 2025oralarXiv:2507.02001
8
citations

TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.

NEURIPS 2025oralarXiv:2512.01048