Poster "multimodal large language models" Papers
228 papers found • Page 5 of 5
Conference
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye, Zitong Yu, Rui Shao et al.
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Yixuan Wu, Yizhou Wang, Shixiang Tang et al.
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You, Haotian Zhang, Eldon Schoop et al.
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang, Xuesong Niu, Nan Jiang et al.
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Zhikai Zhang, Yitang Li, Haofeng Huang et al.
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu et al.
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.
Grounding Language Models for Visual Entity Recognition
Zilin Xiao, Ming Gong, Paola Cascante-Bonilla et al.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
Wei Li, Hehe Fan, Yongkang Wong et al.
Interactive Continual Learning: Fast and Slow Thinking
Biqing Qi, Xinquan Chen, Junqi Gao et al.
LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang
Yuqing Zhang, Hangqi Li, Shengyu Zhang et al.
LLMGA: Multimodal Large Language Model based Generation Assistant
Bin Xia, Shiyin Wang, Yingfan Tao et al.
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning
Zhuo Huang, Chang Liu, Yinpeng Dong et al.
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang et al.
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Xin Liu, Yichen Zhu, Jindong Gu et al.
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu, Hao Fei, Leigang Qu et al.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian liu et al.
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li, Junfeng Wu, Weizhi Zhao et al.
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
Qi Lv, Hao Li, Xiang Deng et al.
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
Kailin Li, Jingbo Wang, Lixin Yang et al.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.
UniCode : Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng, Bohan Zhou, Yicheng Feng et al.
When Do We Not Need Larger Vision Models?
Baifeng Shi, Ziyang Wu, Maolin Mao et al.
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
Swetha Sirnam, Jinyu Yang, Tal Neiman et al.