"multimodal large language models" Papers
300 papers found • Page 6 of 6
Conference
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias et al.
Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
Xiaoyu Yang, Jie Lu, En Yu
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
Zinuo Li, Xian Zhang, Yongxin Guo et al.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.
What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin et al.
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
Yuci Liang, Xinheng Lyu, Meidan Ding et al.
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
jian ma, Qirong Peng, Xu Guo et al.
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang, hongzhen wang, Zonghao Guo et al.
You Only Communicate Once: One-shot Federated Low-Rank Adaptation of MLLM
Binqian Xu, Haiyang Mei, Zechen Bai et al.
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Xudong Li, Mengdan Zhang, Peixian Chen et al.
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu, Xiaosen Zheng, Tianyu Pang et al.
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
Wenbo Hu, Yifan Xu, Yi Li et al.
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye, Zitong Yu, Rui Shao et al.
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Yixuan Wu, Yizhou Wang, Shixiang Tang et al.
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
Yichi Zhang, Yinpeng Dong, Siyuan Zhang et al.
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You, Haotian Zhang, Eldon Schoop et al.
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang, Xuesong Niu, Nan Jiang et al.
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Zhikai Zhang, Yitang Li, Haofeng Huang et al.
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Zhangyang Qi, Ye Fang, Zeyi Sun et al.
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu et al.
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.
Grounding Language Models for Visual Entity Recognition
Zilin Xiao, Ming Gong, Paola Cascante-Bonilla et al.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
Wei Li, Hehe Fan, Yongkang Wong et al.
InstructDoc: A Dataset for Zero
Shot Generalization of Visual Document Understanding with Instructions - Ryota Tanaka, Taichi Iki, Kyosuke Nishida et al.
Interactive Continual Learning: Fast and Slow Thinking
Biqing Qi, Xinquan Chen, Junqi Gao et al.
LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang
Yuqing Zhang, Hangqi Li, Shengyu Zhang et al.
LLMGA: Multimodal Large Language Model based Generation Assistant
Bin Xia, Shiyin Wang, Yingfan Tao et al.
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning
Zhuo Huang, Chang Liu, Yinpeng Dong et al.
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang et al.
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Xin Liu, Yichen Zhu, Jindong Gu et al.
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu, Hao Fei, Leigang Qu et al.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian liu et al.
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li, Junfeng Wu, Weizhi Zhao et al.
PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology
Yuxuan Sun, Chenglu Zhu, Sunyi Zheng et al.
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
Qi Lv, Hao Li, Xiang Deng et al.
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
Kailin Li, Jingbo Wang, Lixin Yang et al.
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang et al.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.
UniCode : Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng, Bohan Zhou, Yicheng Feng et al.
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Yang Jin, Zhicheng Sun, Kun Xu et al.
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei, Shengqiong Wu, Wei Ji et al.
VIGC: Visual Instruction Generation and Correction
Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Xing Han Lù, Zdeněk Kasner, Siva Reddy
When Do We Not Need Larger Vision Models?
Baifeng Shi, Ziyang Wu, Maolin Mao et al.
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
Swetha Sirnam, Jinyu Yang, Tal Neiman et al.