"multimodal large language models" Papers
300 papers found • Page 3 of 6
Conference
Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song et al.
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
Fan Yang, Ru Zhen, Jianing Wang et al.
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
Ma Teng, Xiaojun Jia, Ranjie Duan et al.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Hao Zhou, Zhanning Gao, Zhili Chen et al.
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
Kun Liu, Qi Liu, Xinchen Liu et al.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang
How Can Objects Help Video-Language Understanding?
Zitian Tang, Shijie Wang, Junho Cho et al.
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification
Yang Qin, Chao Chen, Zhihang Fu et al.
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.
IDEA-Bench: How Far are Generative Models from Professional Designing?
Chen Liang, Lianghua Huang, Jingwu Fang et al.
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo, Chuanhao Yan, Xingqian Xu et al.
Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding
Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Minsoo Kim, Kyuhong Shim, Jungwook Choi et al.
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu, Jie Kong, Jiazhi Wang et al.
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin, Hanjia Lyu, Xian Xu et al.
Instruction-based Image Manipulation by Watching How Things Move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng et al.
Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks
Lehan Wang, Haonan Wang, Honglong Yang et al.
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models
Guosheng Zhang, Keyao Wang, Haixiao Yue et al.
Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs
Barrett Tang, Zile Huang, Chengzhi Liu et al.
Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
JiHyeok Jung, EunTae Kim, SeoYeon Kim et al.
Is Your Multimodal Language Model Oversensitive to Safe Queries?
Xirui Li, Hengguang Zhou, Ruochen Wang et al.
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
Kaihang Pan, Yang Wu, Wendong Bu et al.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Junsung Park, Jungbeom Lee, Jongyoon Song et al.
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, shijia Huang, Yanyang Li et al.
Learning to Instruct for Visual Instruction Tuning
Zhihan Zhou, Feng Hong, JIAAN LUO et al.
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
Xuan Wang, Siyuan Liang, Dongping Liao et al.
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Yuxuan Cai, Jiangning Zhang, Haoyang He et al.
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
Haoran Lou, Chunxiao Fan, Ziyan Liu et al.
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen, Zhengrong Yue, Siran Chen et al.
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Bingquan Dai, Luo Li, Qihong Tang et al.
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez et al.
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
Ziming Wei, Bingqian Lin, Zijian Jiao et al.
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
Mingxin Huang, Yuliang Liu, Dingkang Liang et al.
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
Bowen Dong, Minheng Ni, Zitong Huang et al.
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment
Pritam Sarkar, Sayna Ebrahimi, Ali Etemad et al.
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation
Jiaxin Huang, Runnan Chen, Ziwen Li et al.
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
jiarui zhang, Mahyar Khayatkhoei, Prateek Chhikara et al.
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.
MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
Xi Jiang, Jian Li, Hanqiu Deng et al.
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Tianhong Gao, Yannian Fu, Weiqun Wu et al.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen et al.
MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Yunlong Tang, Pinxin Liu, Mingqian Feng et al.