2025 "multimodal large language models" Papers

218 papers found • Page 3 of 5

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025posterarXiv:2501.04931
28
citations

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Kaihang Pan, Yang Wu, Wendong Bu et al.

NEURIPS 2025posterarXiv:2506.01480
6
citations

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song et al.

ICCV 2025posterarXiv:2501.10913
14
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NEURIPS 2025posterarXiv:2505.24625
24
citations

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou, Feng Hong, JIAAN LUO et al.

NEURIPS 2025posterarXiv:2503.22215
3
citations

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

Xuan Wang, Siyuan Liang, Dongping Liao et al.

NEURIPS 2025posterarXiv:2503.16872
4
citations

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai, Jiangning Zhang, Haoyang He et al.

ICCV 2025posterarXiv:2410.16236
23
citations

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou, Chunxiao Fan, Ziyan Liu et al.

ICCV 2025posterarXiv:2507.00505

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Boyu Chen, Zhengrong Yue, Siran Chen et al.

ICCV 2025posterarXiv:2503.10200
21
citations

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.

ICLR 2025posterarXiv:2409.15477
19
citations

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

Bingquan Dai, Luo Li, Qihong Tang et al.

NEURIPS 2025posterarXiv:2508.14879
5
citations

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez et al.

CVPR 2025posterarXiv:2503.13399
14
citations

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Ziming Wei, Bingqian Lin, Zijian Jiao et al.

NEURIPS 2025posterarXiv:2505.20148
1
citations

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

Mingxin Huang, Yuliang Liu, Dingkang Liang et al.

ICLR 2025posterarXiv:2408.02034
22
citations

MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

Bowen Dong, Minheng Ni, Zitong Huang et al.

NEURIPS 2025posterarXiv:2505.24238
2
citations

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NEURIPS 2025oralarXiv:2505.12826
1
citations

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Jiaxin Huang, Runnan Chen, Ziwen Li et al.

NEURIPS 2025posterarXiv:2503.18135
8
citations

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.

NEURIPS 2025posterarXiv:2506.01946
17
citations

MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

Xi Jiang, Jian Li, Hanqiu Deng et al.

ICLR 2025posterarXiv:2410.09453
16
citations

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao, Yannian Fu, Weiqun Wu et al.

ICCV 2025posterarXiv:2507.21924
1
citations

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen et al.

NEURIPS 2025spotlightarXiv:2306.13394
1237
citations

MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.

ICLR 2025posterarXiv:2411.02571
78
citations

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang, Pinxin Liu, Mingqian Feng et al.

NEURIPS 2025posterarXiv:2505.20426
4
citations

MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation

Ning Li, Xiangmou Qu, Jiamu Zhou et al.

NEURIPS 2025oral
15
citations

MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei, Yu Miao, Dongzhan Zhou et al.

NEURIPS 2025oralarXiv:2506.05191
1
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025posterarXiv:2410.08202
68
citations

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain et al.

NEURIPS 2025posterarXiv:2511.20490
1
citations

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao, Jiaxing Huang, Wenhao Wu et al.

NEURIPS 2025spotlightarXiv:2412.18319
102
citations

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Gang Liu, Michael Sun, Wojciech Matusik et al.

ICLR 2025posterarXiv:2410.04223
19
citations

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Wen Jiang, BOSHU LEI, Katrina Ashton et al.

ICCV 2025posterarXiv:2410.17422
9
citations

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.

ICCV 2025posterarXiv:2507.21391
6
citations

Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.

NEURIPS 2025posterarXiv:2506.04088
6
citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NEURIPS 2025posterarXiv:2511.07250
2
citations

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao, Haoyu Lu, Yuqi Huo et al.

ICLR 2025oralarXiv:2406.09367
15
citations

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun et al.

NEURIPS 2025posterarXiv:2510.21122

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Letian Zhang, Quan Cui, Bingchen Zhao et al.

ICCV 2025posterarXiv:2503.08741
6
citations

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025posterarXiv:2506.18557
5
citations

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Yahan Tu, Rui Hu, Jitao Sang

CVPR 2025posterarXiv:2409.09318
3
citations

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma et al.

NEURIPS 2025posterarXiv:2409.15272
51
citations

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang et al.

ICLR 2025posterarXiv:2406.08418
48
citations

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Cheng Luo, Jianghui Wang, Bing Li et al.

NEURIPS 2025posterarXiv:2505.21724

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li et al.

CVPR 2025posterarXiv:2501.00584
9
citations

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Zhongyu Xia, Jishuo Li, Zhiwei Lin et al.

NEURIPS 2025posterarXiv:2411.17761
9
citations

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025posterarXiv:2411.18499
19
citations

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NEURIPS 2025posterarXiv:2507.05255
14
citations

OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Jinhong Wang, Shuo Tong, Jintai CHEN et al.

ICCV 2025posterarXiv:2504.04801

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu, Dakuan Lu, Zicheng Zhao et al.

NEURIPS 2025spotlightarXiv:2511.18450
2
citations

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Jingli Lin, Chenming Zhu, Runsen Xu et al.

NEURIPS 2025oralarXiv:2507.07984
6
citations

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

Yangyu Huang, Tianyi Gao, Haoran Xu et al.

CVPR 2025posterarXiv:2501.06184
6
citations

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing et al.

ICLR 2025posterarXiv:2503.06486
25
citations