2025 Oral "multimodal large language models" Papers
25 papers found
Boosting Knowledge Utilization in Multimodal Large Language Models via Adaptive Logits Fusion and Attention Reallocation
Wenbin An, Jiahao Nie, Feng Tian et al.
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
Weihao Xuan, Junjue Wang, Heli Qi et al.
EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
Yuping He, Yifei Huang, Guo Chen et al.
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Baoqi Pei, Yifei Huang, Jilan Xu et al.
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
Jiani Huang, Amish Sethi, Matthew Kuo et al.
Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding
Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Minsoo Kim, Kyuhong Shim, Jungwook Choi et al.
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation
Ning Li, Xiangmou Qu, Jiamu Zhou et al.
MokA: Multimodal Low-Rank Adaptation for MLLMs
Yake Wei, Yu Miao, Dongzhan Zhou et al.
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao, Haoyu Lu, Yuqi Huo et al.
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Jingli Lin, Chenming Zhu, Runsen Xu et al.
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
Pengteng Li, Pinhao Song, Wuyang Li et al.
StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang et al.
TAU-106K: A New Dataset for Comprehensive Understanding of Traffic Accident
Yixuan Zhou, Long Bai, Sijia Cai et al.
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Yunheng Li, Jing Cheng, Shaoyong Jia et al.
Unhackable Temporal Reward for Scalable Video MLLMs
En Yu, Kangheng Lin, Liang Zhao et al.
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Zeqian Li, Shangzhe Di, Zhonghua Zhai et al.
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Zaiquan Yang, Yuhao LIU, Gerhard Hancke et al.
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan, Yinan He, Xinhao Li et al.
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li et al.
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
Qi Wang, Yanrui Yu, Ye Yuan et al.
Vid-SME: Membership Inference Attacks against Large Video Understanding Models
Qi Li, Runpeng Yu, Xinchao Wang
Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
Xiaoyu Yang, Jie Lu, En Yu
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
Zinuo Li, Xian Zhang, Yongxin Guo et al.