"multimodal large language models" Papers
308 papers found • Page 5 of 7
Conference
ScImage: How good are multimodal large language models at scientific text-to-image generation?
Leixin Zhang, Steffen Eger, Yinjie Cheng et al.
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
feilong tang, Chengzhi Liu, Zhongxing Xu et al.
Seeking and Updating with Live Visual Knowledge
Mingyang Fu, Yuyang Peng, Dongping Chen et al.
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
Pengteng Li, Pinhao Song, Wuyang Li et al.
SegLLM: Multi-round Reasoning Segmentation with Large Language Models
Xudong Wang, Shaolun Zhang, Shufan Li et al.
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Qihang Fan, Huaibo Huang, Mingrui Chen et al.
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu, Zikai Song, Na Feng et al.
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Qianhao Yuan, Qingyu Zhang, yanjiang liu et al.
Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Ruiping Liu, Junwei Zheng, Yufan Chen et al.
SketchAgent: Language-Driven Sequential Sketch Generation
Yael Vinker, Tamar Rott Shaham, Kristine Zheng et al.
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji et al.
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
Ying Chen, Guoan Wang, Yuanfeng Ji et al.
SMMILE: An expert-driven benchmark for multimodal medical in-context learning
Melanie Rieff, Maya Varma, Ossian Rabow et al.
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
Ziqi Wang, Chang Che, Qi Wang et al.
SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs
zhicheng li, Shuoming Zhang, Jiacheng Zhao et al.
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Yongsen Mao, Junhao Zhong, Chuan Fang et al.
Spatially-aware Weights Tokenization for NeRF-Language Models
Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti et al.
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
Haotian Xia, Zhengbang Yang, Junbo Zou et al.
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Yun Li, Yiming Zhang, Tao Lin et al.
StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang et al.
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Fangrui Zhu, Hanhui Wang, Yiming Xie et al.
Structure-Aware Cooperative Ensemble Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models
Jie Zhao, Kang Cheong
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Hao Li, Changyao TIAN, Jie Shao et al.
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
Bowen Wang, Zhouqiang Jiang, Yasuaki Susumu et al.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
ziang yan, Zhilin Li, Yinan He et al.
TAU-106K: A New Dataset for Comprehensive Understanding of Traffic Accident
Yixuan Zhou, Long Bai, Sijia Cai et al.
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Yunheng Li, Jing Cheng, Shaoyong Jia et al.
Text4Seg: Reimagining Image Segmentation as Text Generation
Mengcheng Lan, Chaofeng Chen, Yue Zhou et al.
TextToucher: Fine-Grained Text-to-Touch Generation
Jiahang Tu, Hao Fu, Fengyu Yang et al.
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
Hao Yin, Guangzong Si, Zilei Wang
The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers
Daiqing Qi, Handong Zhao, Jing Shi et al.
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma et al.
un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Yinqi Li, Jiahe Zhao, Hong Chang et al.
Unhackable Temporal Reward for Scalable Video MLLMs
En Yu, Kangheng Lin, Liang Zhao et al.
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Zeqian Li, Shangzhe Di, Zhonghua Zhai et al.
Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Yunqi Hong, Sohyun An, Andrew Bai et al.
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Zaiquan Yang, Yuhao LIU, Gerhard Hancke et al.
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
Ruilin Luo, Zhuofan Zheng, Lei Wang et al.
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang et al.
Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
Zhixuan Li, Hyunse Yoon, Sanghoon Lee et al.
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi, Lianggong Bruce Wen, Zhang Liu et al.
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Qihui Zhang, Munan Ning, Zheyuan Liu et al.
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Yuanxin Liu, Rui Zhu, Shuhuai Ren et al.
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Yunlong Tang, JunJia Guo, Hang Hua et al.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, Liwei Wang
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan, Yinan He, Xinhao Li et al.
Video Perception Models for 3D Scene Synthesis
Rui Huang, Guangyao Zhai, Zuria Bauer et al.
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li et al.
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
Qi Wang, Yanrui Yu, Ye Yuan et al.