"multimodal large language models" Papers

308 papers found • Page 4 of 7

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao, Yannian Fu, Weiqun Wu et al.

ICCV 2025arXiv:2507.21924
1
citations

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen et al.

NEURIPS 2025spotlightarXiv:2306.13394
1255
citations

MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.

ICLR 2025arXiv:2411.02571
86
citations

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang, Pinxin Liu, Mingqian Feng et al.

NEURIPS 2025arXiv:2505.20426
4
citations

MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation

Ning Li, Xiangmou Qu, Jiamu Zhou et al.

NEURIPS 2025oral
15
citations

MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei, Yu Miao, Dongzhan Zhou et al.

NEURIPS 2025oralarXiv:2506.05191
1
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025arXiv:2410.08202
68
citations

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain et al.

NEURIPS 2025arXiv:2511.20490
1
citations

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao, Jiaxing Huang, Wenhao Wu et al.

NEURIPS 2025spotlightarXiv:2412.18319
106
citations

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

Junyan Lin, Haoran Chen, Yue Fan et al.

CVPR 2025arXiv:2503.06063
16
citations

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Gang Liu, Michael Sun, Wojciech Matusik et al.

ICLR 2025arXiv:2410.04223
22
citations

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Wen Jiang, BOSHU LEI, Katrina Ashton et al.

ICCV 2025arXiv:2410.17422
9
citations

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.

ICCV 2025arXiv:2507.21391
7
citations

Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.

NEURIPS 2025arXiv:2506.04088
9
citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NEURIPS 2025arXiv:2511.07250
2
citations

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao, Haoyu Lu, Yuqi Huo et al.

ICLR 2025oralarXiv:2406.09367
15
citations

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun et al.

NEURIPS 2025arXiv:2510.21122
1
citations

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Letian Zhang, Quan Cui, Bingchen Zhao et al.

ICCV 2025arXiv:2503.08741
8
citations

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025arXiv:2506.18557
5
citations

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Yahan Tu, Rui Hu, Jitao Sang

CVPR 2025arXiv:2409.09318
3
citations

Olympus: A Universal Task Router for Computer Vision Tasks

Yuanze Lin, Yunsheng Li, Dongdong Chen et al.

CVPR 2025highlightarXiv:2412.09612
3
citations

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma et al.

NEURIPS 2025arXiv:2409.15272
53
citations

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang et al.

ICLR 2025arXiv:2406.08418
49
citations

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Cheng Luo, Jianghui Wang, Bing Li et al.

NEURIPS 2025arXiv:2505.21724

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li et al.

CVPR 2025arXiv:2501.00584
12
citations

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Zhongyu Xia, Jishuo Li, Zhiwei Lin et al.

NEURIPS 2025arXiv:2411.17761
9
citations

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025arXiv:2411.18499
20
citations

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NEURIPS 2025arXiv:2507.05255
14
citations

OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Jinhong Wang, Shuo Tong, Jintai CHEN et al.

ICCV 2025arXiv:2504.04801

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu, Dakuan Lu, Zicheng Zhao et al.

NEURIPS 2025spotlightarXiv:2511.18450
2
citations

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Jingli Lin, Chenming Zhu, Runsen Xu et al.

NEURIPS 2025oralarXiv:2507.07984
7
citations

ParGo: Bridging Vision-Language with Partial and Global Views

An-Lan Wang, Bin Shan, Wei Shi et al.

AAAI 2025paperarXiv:2408.12928
25
citations

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

Yangyu Huang, Tianyi Gao, Haoran Xu et al.

CVPR 2025arXiv:2501.06184
8
citations

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing et al.

ICLR 2025arXiv:2503.06486
29
citations

Pilot: Building the Federated Multimodal Instruction Tuning Framework

Baochen Xiong, Xiaoshan Yang, Yaguang Song et al.

AAAI 2025paperarXiv:2501.13985
4
citations

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Haoyu Chen, Xiaojie Xu, Wenbo Li et al.

CVPR 2025arXiv:2503.14908
24
citations

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Linh Tran, Wei Sun, Stacy Patterson et al.

ICLR 2025arXiv:2501.13904
5
citations

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.

ICCV 2025arXiv:2506.22139
23
citations

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu et al.

NEURIPS 2025arXiv:2512.05119

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin et al.

CVPR 2025highlightarXiv:2412.11077
18
citations

Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang et al.

NEURIPS 2025arXiv:2509.17784

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025highlightarXiv:2405.17220
58
citations

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

Baihui Xiao, Chengjian Feng, Zhijian Huang et al.

ICCV 2025arXiv:2508.04642
3
citations

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu et al.

NEURIPS 2025spotlightarXiv:2502.17159
9
citations

ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

Heng Yin, Yuqiang Ren, Ke Yan et al.

CVPR 2025
8
citations

Routing Experts: Learning to Route Dynamic Experts in Existing Multi-modal Large Language Models

Qiong Wu, Zhaoxi Ke, Yiyi Zhou et al.

ICLR 2025
7
citations

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun, Sicheng Tao, Jungang Li et al.

NEURIPS 2025arXiv:2505.02064
5
citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NEURIPS 2025arXiv:2503.17682
9
citations

Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Qingni Wang, Tiantian Geng, Zhiyuan Wang et al.

ICLR 2025arXiv:2410.08174
14
citations

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

NEURIPS 2025arXiv:2506.10521
18
citations