"multimodal large language models" Papers

212 papers found • Page 3 of 5

MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

Bowen Dong, Minheng Ni, Zitong Huang et al.

NeurIPS 2025posterarXiv:2505.24238
2
citations

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NeurIPS 2025oralarXiv:2505.12826
1
citations

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Jiaxin Huang, Runnan Chen, Ziwen Li et al.

NeurIPS 2025posterarXiv:2503.18135
8
citations

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.

NeurIPS 2025posterarXiv:2506.01946
17
citations

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao, Yannian Fu, Weiqun Wu et al.

ICCV 2025posterarXiv:2507.21924
1
citations

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen et al.

NeurIPS 2025spotlightarXiv:2306.13394
1237
citations

MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.

ICLR 2025posterarXiv:2411.02571
78
citations

MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation

Ning Li, Xiangmou Qu, Jiamu Zhou et al.

NeurIPS 2025oral
15
citations

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain et al.

NeurIPS 2025posterarXiv:2511.20490
1
citations

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao, Jiaxing Huang, Wenhao Wu et al.

NeurIPS 2025spotlightarXiv:2412.18319
102
citations

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Gang Liu, Michael Sun, Wojciech Matusik et al.

ICLR 2025posterarXiv:2410.04223
19
citations

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Wen Jiang, BOSHU LEI, Katrina Ashton et al.

ICCV 2025posterarXiv:2410.17422
9
citations

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.

ICCV 2025posterarXiv:2507.21391
6
citations

Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.

NeurIPS 2025posterarXiv:2506.04088
6
citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NeurIPS 2025posterarXiv:2511.07250
2
citations

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun et al.

NeurIPS 2025posterarXiv:2510.21122

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025posterarXiv:2506.18557
5
citations

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Yahan Tu, Rui Hu, Jitao Sang

CVPR 2025posterarXiv:2409.09318
3
citations

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang et al.

ICLR 2025posterarXiv:2406.08418
48
citations

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li et al.

CVPR 2025posterarXiv:2501.00584
9
citations

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NeurIPS 2025posterarXiv:2507.05255
14
citations

OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Jinhong Wang, Shuo Tong, Jintai CHEN et al.

ICCV 2025posterarXiv:2504.04801

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu, Dakuan Lu, Zicheng Zhao et al.

NeurIPS 2025spotlightarXiv:2511.18450
2
citations

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

Yangyu Huang, Tianyi Gao, Haoran Xu et al.

CVPR 2025posterarXiv:2501.06184
6
citations

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing et al.

ICLR 2025posterarXiv:2503.06486
25
citations

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Haoyu Chen, Xiaojie Xu, Wenbo Li et al.

CVPR 2025posterarXiv:2503.14908
23
citations

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Linh Tran, Wei Sun, Stacy Patterson et al.

ICLR 2025posterarXiv:2501.13904
5
citations

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.

ICCV 2025posterarXiv:2506.22139
23
citations

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu et al.

NeurIPS 2025posterarXiv:2512.05119

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin et al.

CVPR 2025highlightarXiv:2412.11077
15
citations

Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang et al.

NeurIPS 2025posterarXiv:2509.17784

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025highlightarXiv:2405.17220
54
citations

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun, Sicheng Tao, Jungang Li et al.

NeurIPS 2025posterarXiv:2505.02064
5
citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NeurIPS 2025posterarXiv:2503.17682
8
citations

Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Qingni Wang, Tiantian Geng, Zhiyuan Wang et al.

ICLR 2025posterarXiv:2410.08174

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

NeurIPS 2025posterarXiv:2506.10521
15
citations

ScImage: How good are multimodal large language models at scientific text-to-image generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng et al.

ICLR 2025posterarXiv:2412.02368
4
citations

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.

NeurIPS 2025posterarXiv:2510.24214

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

feilong tang, Chengzhi Liu, Zhongxing Xu et al.

CVPR 2025posterarXiv:2505.16652
22
citations

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Pengteng Li, Pinhao Song, Wuyang Li et al.

NeurIPS 2025oralarXiv:2509.16087
1
citations

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

Qihang Fan, Huaibo Huang, Mingrui Chen et al.

ICCV 2025posterarXiv:2405.13337
3
citations

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Yangliu Hu, Zikai Song, Na Feng et al.

CVPR 2025posterarXiv:2504.07745
11
citations

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu, Junwei Zheng, Yufan Chen et al.

NeurIPS 2025posterarXiv:2510.11509

SketchAgent: Language-Driven Sequential Sketch Generation

Yael Vinker, Tamar Rott Shaham, Kristine Zheng et al.

CVPR 2025posterarXiv:2411.17673
17
citations

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

Weili Zeng, Ziyuan Huang, Kaixiang Ji et al.

ICCV 2025posterarXiv:2503.21817
4
citations

SMMILE: An expert-driven benchmark for multimodal medical in-context learning

Melanie Rieff, Maya Varma, Ossian Rabow et al.

NeurIPS 2025posterarXiv:2506.21355
3
citations

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

Ziqi Wang, Chang Che, Qi Wang et al.

ICCV 2025posterarXiv:2411.13949
3
citations

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao, Junhao Zhong, Chuan Fang et al.

NeurIPS 2025posterarXiv:2506.07491
21
citations

SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

Haotian Xia, Zhengbang Yang, Junbo Zou et al.

ICLR 2025posterarXiv:2410.08474
13
citations

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang et al.

NeurIPS 2025oralarXiv:2509.24871
3
citations