2025 Poster "multimodal large language models" Papers
92 papers found • Page 1 of 2
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu et al.
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Xue zhucun, Jiangning Zhang, Xie Xurong et al.
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia, Sensen Gao, Simeng Qin et al.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.
ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Lingfeng Wang, Hualing Lin, Senda Chen et al.
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man, De-An Huang, Guilin Liu et al.
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin, Jaemin Cho, Amir Zadeh et al.
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
junyan ye, Dongzhi JIANG, Jun He et al.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
Junho Kim, Hyungjin Chung, Byung-Hoon Kim
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang et al.
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Wenhui Liao, Jiapeng Wang, Hongliang Li et al.
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu, Zhibo Yang, Yuliang Liu et al.
Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou, Di Lu, Yizhou Wang et al.
DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning
Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi, Fuxiao Liu, Shihao Wang et al.
Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao, Nanxin Huang, Hao Qiu et al.
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
Jiahao Wang, Weiye Xu, Aijun Yang et al.
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
Shivam Duggal, Yushi Hu, Oscar Michel et al.
EventGPT: Event Stream Understanding with Multimodal Large Language Models
shaoyu liu, Jianing Li, guanghui zhao et al.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
Jiayi Guo, Zhao Junhao, Chaoqun Du et al.
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging
Zichen Tang, Haihong E, Jiacheng Liu et al.
Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models
Hai Yan, Haijian Ma, Xiaowen Cai et al.
FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers
Yanbing Zhang, Zhe Wang, Qin Zhou et al.
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
Yi Fang, Bowen Jin, Jiacheng Shen et al.
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Xiaomeng Chu, Jiajun Deng, Guoliang You et al.
Grounding Multimodal Large Language Model in GUI World
Weixian Lei, Difei Gao, Mike Zheng Shou
Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song et al.
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
Ma Teng, Xiaojun Jia, Ranjie Duan et al.
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Hao Zhou, Zhanning Gao, Zhili Chen et al.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang
IDEA-Bench: How Far are Generative Models from Professional Designing?
Chen Liang, Lianghua Huang, Jingwu Fang et al.
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo, Chuanhao Yan, Xingqian Xu et al.
Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks
Lehan Wang, Haonan Wang, Honglong Yang et al.
Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs
Barrett Tang, Zile Huang, Chengzhi Liu et al.
Is Your Multimodal Language Model Oversensitive to Safe Queries?
Xirui Li, Hengguang Zhou, Ruochen Wang et al.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
Kaihang Pan, Yang Wu, Wendong Bu et al.
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, shijia Huang, Yanyang Li et al.
Learning to Instruct for Visual Instruction Tuning
Zhihan Zhou, Feng Hong, JIAAN LUO et al.
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
Xuan Wang, Siyuan Liang, Dongping Liao et al.
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Yuxuan Cai, Jiangning Zhang, Haoyang He et al.
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
Haoran Lou, Chunxiao Fan, Ziyan Liu et al.
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen, Zhengrong Yue, Siran Chen et al.
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
Bowen Dong, Minheng Ni, Zitong Huang et al.