"multimodal large language models" Papers
292 papers found • Page 1 of 6
Conference
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu et al.
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou et al.
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking
Lequan Lin, Dai Shi, Andi Han et al.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
Yiyang Du, Xiaochen Wang, Chi Chen et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Xue zhucun, Jiangning Zhang, Xie Xurong et al.
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia, Sensen Gao, Simeng Qin et al.
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Xinyi Wang, Xun Yang, Yanlong Xu et al.
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.
ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Lingfeng Wang, Hualing Lin, Senda Chen et al.
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Junhao Cheng, Yuying Ge, Yixiao Ge et al.
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
YUEJIAO SU, Yi Wang, Qiongyang Hu et al.
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man, De-An Huang, Guilin Liu et al.
Assessing and Learning Alignment of Unimodal Vision and Language Models
Le Zhang, Qian Yang, Aishwarya Agrawal
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park, Kuk Jin Jang, Basam Alasaly et al.
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
Yi-Ting Shen, Sungmin Eum, Doheon Lee et al.
BadRobot: Jailbreaking Embodied LLM Agents in the Physical World
Hangtao Zhang, Chenyu Zhu, Xianlong Wang et al.
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
Jianting Tang, Yubo Wang, Haoyu Cao et al.
BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation
Haotian Peng, Jiawei Liu, Jinsong Du et al.
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Minghe Gao, Xuqi Liu, Zhongqi Yue et al.
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li et al.
Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution
Wentao Tan, Qiong Cao, Yibing Zhan et al.
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin, Jaemin Cho, Amir Zadeh et al.
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
junyan ye, Dongzhi JIANG, Jun He et al.
Boosting Knowledge Utilization in Multimodal Large Language Models via Adaptive Logits Fusion and Attention Reallocation
Wenbin An, Jiahao Nie, Feng Tian et al.
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Jiaer Xia, Bingkui Tong, Yuhang Zang et al.
Bridging Compressed Image Latents and Multimodal Large Language Models
Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng et al.
Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu, Enxin Song, Wenhao Chai et al.
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
Siyu Wang, Cailian Chen, Xinyi Le et al.
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Zhihang Liu, Chen-Wei Xie, Bin Wen et al.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
Junho Kim, Hyungjin Chung, Byung-Hoon Kim
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
Yunlong Tang, Gen Zhan, Li Yang et al.
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang et al.
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding
Muye Huang, Lingling Zhang, Jie Ma et al.
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Jinlan Fu, Shenzhen Huangfu, Hao Fei et al.
Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search
Haoran Sun, Yankai Jiang, Wenjie Lou et al.
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai, Jie Zhou, Xingjiao Wu et al.
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
Shengqiong Wu, Hao Fei, Liangming Pan et al.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen, Lin Li, Yongqi Yang et al.
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models
Yeji Park, Deokyeong Lee, Junsuk Choe et al.
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Jingjing Jiang, Chongjie Si, Jun Luo et al.
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Jingjing Jiang, Chao Ma, Xurui Song et al.
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
Zheyuan Liu, Munan Ning, Qihui Zhang et al.
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Jiansheng Li, Xingxuan Zhang, Hao Zou et al.
Crafting Dynamic Virtual Activities with Advanced Multimodal Models
Changyang Li, Qingan Yan, Minyoung Kim et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.