"multimodal large language models" Papers
212 papers found • Page 3 of 5
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
Bowen Dong, Minheng Ni, Zitong Huang et al.
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation
Jiaxin Huang, Runnan Chen, Ziwen Li et al.
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Tianhong Gao, Yannian Fu, Weiqun Wu et al.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen et al.
MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.
MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation
Ning Li, Xiangmou Qu, Jiamu Zhou et al.
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain et al.
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Huanjin Yao, Jiaxing Huang, Wenhao Wu et al.
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning
Gang Liu, Michael Sun, Wojciech Matusik et al.
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
Wen Jiang, BOSHU LEI, Katrina Ashton et al.
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.
Multimodal Tabular Reasoning with Privileged Structured Information
Jun-Peng Jiang, Yu Xia, Hai-Long Sun et al.
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
Longtian Qiu, Shan Ning, Jiaxuan Sun et al.
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um, Dongjin Kim, Sangmin Lee et al.
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
Yahan Tu, Rui Hu, Jitao Sang
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li, Zhe Chen, Weiyun Wang et al.
Online Video Understanding: OVBench and VideoChat-Online
Zhenpeng Huang, Xinhao Li, Jiaqi Li et al.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun et al.
OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM
Jinhong Wang, Shuo Tong, Jintai CHEN et al.
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
Rui Xu, Dakuan Lu, Zicheng Zhao et al.
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
Yangyu Huang, Tianyi Gao, Haoran Xu et al.
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Cong Chen, Mingyu Liu, Chenchen Jing et al.
POSTA: A Go-to Framework for Customized Artistic Poster Generation
Haoyu Chen, Xiaojie Xu, Wenbo Li et al.
Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models
Linh Tran, Wei Sun, Stacy Patterson et al.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering
Rongyang Zhang, Yuqing Huang, Chengqiang Lu et al.
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jue Zhang, Xiaoting Qin et al.
Revealing Multimodal Causality with Large Language Models
Jin Li, Shoujin Wang, Qi Zhang et al.
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li et al.
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
ShuHang Xun, Sicheng Tao, Jungang Li et al.
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Jiaming Ji, Xinyu Chen, Rui Pan et al.
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Qingni Wang, Tiantian Geng, Zhiyuan Wang et al.
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou, Yiheng Wang, Xuming He et al.
ScImage: How good are multimodal large language models at scientific text-to-image generation?
Leixin Zhang, Steffen Eger, Yinjie Cheng et al.
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
feilong tang, Chengzhi Liu, Zhongxing Xu et al.
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
Pengteng Li, Pinhao Song, Wuyang Li et al.
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Qihang Fan, Huaibo Huang, Mingrui Chen et al.
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu, Zikai Song, Na Feng et al.
Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Ruiping Liu, Junwei Zheng, Yufan Chen et al.
SketchAgent: Language-Driven Sequential Sketch Generation
Yael Vinker, Tamar Rott Shaham, Kristine Zheng et al.
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji et al.
SMMILE: An expert-driven benchmark for multimodal medical in-context learning
Melanie Rieff, Maya Varma, Ossian Rabow et al.
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
Ziqi Wang, Chang Che, Qi Wang et al.
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Yongsen Mao, Junhao Zhong, Chuan Fang et al.
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
Haotian Xia, Zhengbang Yang, Junbo Zou et al.
StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang et al.