Poster "multimodal large language models" Papers
228 papers found • Page 1 of 5
Conference
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu et al.
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou et al.
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking
Lequan Lin, Dai Shi, Andi Han et al.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
Yiyang Du, Xiaochen Wang, Chi Chen et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Xue zhucun, Jiangning Zhang, Xie Xurong et al.
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia, Sensen Gao, Simeng Qin et al.
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Xinyi Wang, Xun Yang, Yanlong Xu et al.
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.
ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Lingfeng Wang, Hualing Lin, Senda Chen et al.
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Junhao Cheng, Yuying Ge, Yixiao Ge et al.
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
YUEJIAO SU, Yi Wang, Qiongyang Hu et al.
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man, De-An Huang, Guilin Liu et al.
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
Yi-Ting Shen, Sungmin Eum, Doheon Lee et al.
BadRobot: Jailbreaking Embodied LLM Agents in the Physical World
Hangtao Zhang, Chenyu Zhu, Xianlong Wang et al.
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
Jianting Tang, Yubo Wang, Haoyu Cao et al.
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Minghe Gao, Xuqi Liu, Zhongqi Yue et al.
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li et al.
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin, Jaemin Cho, Amir Zadeh et al.
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
junyan ye, Dongzhi JIANG, Jun He et al.
Bridging Compressed Image Latents and Multimodal Large Language Models
Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng et al.
Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu, Enxin Song, Wenhao Chai et al.
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Zhihang Liu, Chen-Wei Xie, Bin Wen et al.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
Junho Kim, Hyungjin Chung, Byung-Hoon Kim
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang et al.
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding
Muye Huang, Lingling Zhang, Jie Ma et al.
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Jinlan Fu, Shenzhen Huangfu, Hao Fei et al.
Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search
Haoran Sun, Yankai Jiang, Wenjie Lou et al.
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
Zheyuan Liu, Munan Ning, Qihui Zhang et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
JIAHE ZHAO, rongkun Zheng, Yi Wang et al.
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Wenhui Liao, Jiapeng Wang, Hongliang Li et al.
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu, Zhibo Yang, Yuliang Liu et al.
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback
Janet Wang, Yunbei Zhang, Zhengming Ding et al.
Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou, Di Lu, Yizhou Wang et al.
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin, Xinyu Wei, Ruichuan An et al.
DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning
Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi, Fuxiao Liu, Shihao Wang et al.
Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao, Nanxin Huang, Hao Qiu et al.
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Sheng Zhou, Junbin Xiao, Qingyun Li et al.
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Jitesh Jain, Zhengyuan Yang, Humphrey Shi et al.
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
Jiahao Wang, Weiye Xu, Aijun Yang et al.
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
Shivam Duggal, Yushi Hu, Oscar Michel et al.
EventGPT: Event Stream Understanding with Multimodal Large Language Models
shaoyu liu, Jianing Li, guanghui zhao et al.