NeurIPS "multimodal large language models" Papers

83 papers found • Page 2 of 2

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NeurIPS 2025posterarXiv:2507.05255
14
citations

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu, Dakuan Lu, Zicheng Zhao et al.

NeurIPS 2025spotlightarXiv:2511.18450
2
citations

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu et al.

NeurIPS 2025posterarXiv:2512.05119

Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang et al.

NeurIPS 2025posterarXiv:2509.17784

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun, Sicheng Tao, Jungang Li et al.

NeurIPS 2025posterarXiv:2505.02064
5
citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NeurIPS 2025posterarXiv:2503.17682
8
citations

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

NeurIPS 2025posterarXiv:2506.10521
15
citations

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.

NeurIPS 2025posterarXiv:2510.24214

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Pengteng Li, Pinhao Song, Wuyang Li et al.

NeurIPS 2025oralarXiv:2509.16087
1
citations

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu, Junwei Zheng, Yufan Chen et al.

NeurIPS 2025posterarXiv:2510.11509

SMMILE: An expert-driven benchmark for multimodal medical in-context learning

Melanie Rieff, Maya Varma, Ossian Rabow et al.

NeurIPS 2025posterarXiv:2506.21355
3
citations

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao, Junhao Zhong, Chuan Fang et al.

NeurIPS 2025posterarXiv:2506.07491
21
citations

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang et al.

NeurIPS 2025oralarXiv:2509.24871
3
citations

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu, Hanhui Wang, Yiming Xie et al.

NeurIPS 2025posterarXiv:2506.04220

Structure-Aware Cooperative Ensemble Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models

Jie Zhao, Kang Cheong

NeurIPS 2025posterarXiv:2510.21906

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

Hao Yin, Guangzong Si, Zilei Wang

NeurIPS 2025posterarXiv:2504.10020

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li, Jiahe Zhao, Hong Chang et al.

NeurIPS 2025posterarXiv:2505.24517
1
citations

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Zeqian Li, Shangzhe Di, Zhonghua Zhai et al.

NeurIPS 2025oralarXiv:2506.18883
9
citations

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Yunqi Hong, Sohyun An, Andrew Bai et al.

NeurIPS 2025posterarXiv:2506.03195
1
citations

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Zaiquan Yang, Yuhao LIU, Gerhard Hancke et al.

NeurIPS 2025oralarXiv:2509.15178
2
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Lei Wang et al.

NeurIPS 2025posterarXiv:2501.04686
29
citations

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu, Rui Zhu, Shuhuai Ren et al.

NeurIPS 2025posterarXiv:2503.09949
2
citations

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan, Yinan He, Xinhao Li et al.

NeurIPS 2025oralarXiv:2509.21100
13
citations

Video Perception Models for 3D Scene Synthesis

Rui Huang, Guangyao Zhai, Zuria Bauer et al.

NeurIPS 2025posterarXiv:2506.20601
5
citations

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025oralarXiv:2503.21776
236
citations

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang, Yanrui Yu, Ye Yuan et al.

NeurIPS 2025oralarXiv:2505.12434
30
citations

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Qi Li, Runpeng Yu, Xinchao Wang

NeurIPS 2025oralarXiv:2506.03179
5
citations

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun et al.

NeurIPS 2025posterarXiv:2411.16034

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang et al.

NeurIPS 2025spotlightarXiv:2501.01957
130
citations

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo et al.

NeurIPS 2025oralarXiv:2505.18110
3
citations

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.

NeurIPS 2025spotlightarXiv:2505.15277
8
citations

You Only Communicate Once: One-shot Federated Low-Rank Adaptation of MLLM

Binqian Xu, Haiyang Mei, Zechen Bai et al.

NeurIPS 2025poster

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen et al.

NeurIPS 2025posterarXiv:2505.22396
1
citations