NeurIPS 2025 "multimodal large language models" Papers

41 papers found

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu, Wenhao Lin, Yiyi Zhou et al.

NeurIPS 2025posterarXiv:2411.19628
5
citations

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Xue zhucun, Jiangning Zhang, Xie Xurong et al.

NeurIPS 2025posterarXiv:2506.13589
7
citations

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Xiaojun Jia, Sensen Gao, Simeng Qin et al.

NeurIPS 2025posterarXiv:2505.21494
12
citations

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Lingfeng Wang, Hualing Lin, Senda Chen et al.

NeurIPS 2025posterarXiv:2505.16495
2
citations

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin, Jaemin Cho, Amir Zadeh et al.

NeurIPS 2025posterarXiv:2508.05954
6
citations

Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou, Di Lu, Yizhou Wang et al.

NeurIPS 2025posterarXiv:2510.02912
7
citations

DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning

Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.

NeurIPS 2025posterarXiv:2505.20241
5
citations

EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu et al.

NeurIPS 2025posterarXiv:2503.08221
8
citations

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang et al.

NeurIPS 2025posterarXiv:2511.10648

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang, Amish Sethi, Matthew Kuo et al.

NeurIPS 2025oralarXiv:2510.15963

Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models

Hai Yan, Haijian Ma, Xiaowen Cai et al.

NeurIPS 2025poster

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang, Yao Lai, Aoxue Li et al.

NeurIPS 2025spotlightarXiv:2505.20147
20
citations

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li et al.

NeurIPS 2025spotlightarXiv:2505.21375
11
citations

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Kaihang Pan, Yang Wu, Wendong Bu et al.

NeurIPS 2025posterarXiv:2506.01480
6
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NeurIPS 2025posterarXiv:2505.24625
24
citations

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou, Feng Hong, JIAAN LUO et al.

NeurIPS 2025posterarXiv:2503.22215
3
citations

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

Xuan Wang, Siyuan Liang, Dongping Liao et al.

NeurIPS 2025posterarXiv:2503.16872
4
citations

MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

Bowen Dong, Minheng Ni, Zitong Huang et al.

NeurIPS 2025posterarXiv:2505.24238
2
citations

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NeurIPS 2025oralarXiv:2505.12826
1
citations

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.

NeurIPS 2025posterarXiv:2506.01946
17
citations

Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.

NeurIPS 2025posterarXiv:2506.04088
6
citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NeurIPS 2025posterarXiv:2511.07250
2
citations

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu et al.

NeurIPS 2025posterarXiv:2512.05119

Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang et al.

NeurIPS 2025posterarXiv:2509.17784

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun, Sicheng Tao, Jungang Li et al.

NeurIPS 2025posterarXiv:2505.02064
5
citations

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.

NeurIPS 2025posterarXiv:2510.24214

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao, Junhao Zhong, Chuan Fang et al.

NeurIPS 2025posterarXiv:2506.07491
21
citations

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

Hao Yin, Guangzong Si, Zilei Wang

NeurIPS 2025posterarXiv:2504.10020

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li, Jiahe Zhao, Hong Chang et al.

NeurIPS 2025posterarXiv:2505.24517
1
citations

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Zeqian Li, Shangzhe Di, Zhonghua Zhai et al.

NeurIPS 2025oralarXiv:2506.18883
9
citations

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Zaiquan Yang, Yuhao LIU, Gerhard Hancke et al.

NeurIPS 2025oralarXiv:2509.15178
2
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Lei Wang et al.

NeurIPS 2025posterarXiv:2501.04686
29
citations

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan, Yinan He, Xinhao Li et al.

NeurIPS 2025oralarXiv:2509.21100
13
citations

Video Perception Models for 3D Scene Synthesis

Rui Huang, Guangyao Zhai, Zuria Bauer et al.

NeurIPS 2025posterarXiv:2506.20601
5
citations

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025oralarXiv:2503.21776
236
citations

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Qi Li, Runpeng Yu, Xinchao Wang

NeurIPS 2025oralarXiv:2506.03179
5
citations

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun et al.

NeurIPS 2025posterarXiv:2411.16034

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang et al.

NeurIPS 2025spotlightarXiv:2501.01957
130
citations

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo et al.

NeurIPS 2025oralarXiv:2505.18110
3
citations

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.

NeurIPS 2025spotlightarXiv:2505.15277
8
citations

You Only Communicate Once: One-shot Federated Low-Rank Adaptation of MLLM

Binqian Xu, Haiyang Mei, Zechen Bai et al.

NeurIPS 2025poster