2025 "multimodal large language models" Papers

241 papers found • Page 5 of 5

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Hao Li, Changyao TIAN, Jie Shao et al.

CVPR 2025posterarXiv:2412.09604
35
citations

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Bowen Wang, Zhouqiang Jiang, Yasuaki Susumu et al.

ICCV 2025posterarXiv:2506.17589
1
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

ziang yan, Zhilin Li, Yinan He et al.

CVPR 2025posterarXiv:2412.19326
19
citations

TAU-106K: A New Dataset for Comprehensive Understanding of Traffic Accident

Yixuan Zhou, Long Bai, Sijia Cai et al.

ICLR 2025oral
3
citations

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Yunheng Li, Jing Cheng, Shaoyong Jia et al.

NEURIPS 2025oralarXiv:2509.18056
6
citations

Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan, Chaofeng Chen, Yue Zhou et al.

ICLR 2025posterarXiv:2410.09855
34
citations

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

Hao Yin, Guangzong Si, Zilei Wang

NEURIPS 2025posterarXiv:2504.10020

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma et al.

ICCV 2025posterarXiv:2410.10491

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li, Jiahe Zhao, Hong Chang et al.

NEURIPS 2025posterarXiv:2505.24517
1
citations

Unhackable Temporal Reward for Scalable Video MLLMs

En Yu, Kangheng Lin, Liang Zhao et al.

ICLR 2025oralarXiv:2502.12081
1
citations

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Zeqian Li, Shangzhe Di, Zhonghua Zhai et al.

NEURIPS 2025oralarXiv:2506.18883
9
citations

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Yunqi Hong, Sohyun An, Andrew Bai et al.

NEURIPS 2025posterarXiv:2506.03195
1
citations

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Zaiquan Yang, Yuhao LIU, Gerhard Hancke et al.

NEURIPS 2025oralarXiv:2509.15178
2
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Lei Wang et al.

NEURIPS 2025posterarXiv:2501.04686
29
citations

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang et al.

CVPR 2025posterarXiv:2406.10638
19
citations

Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

Zhixuan Li, Hyunse Yoon, Sanghoon Lee et al.

ICCV 2025posterarXiv:2503.10225
3
citations

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Jing Bi, Lianggong Bruce Wen, Zhang Liu et al.

CVPR 2025posterarXiv:2412.18108
18
citations

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang, Munan Ning, Zheyuan Liu et al.

CVPR 2025posterarXiv:2503.14941
2
citations

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu, Rui Zhu, Shuhuai Ren et al.

NEURIPS 2025posterarXiv:2503.09949
2
citations

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yunlong Tang, JunJia Guo, Hang Hua et al.

CVPR 2025posterarXiv:2411.10979
16
citations

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, Liwei Wang

CVPR 2025posterarXiv:2412.00493
65
citations

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan, Yinan He, Xinhao Li et al.

NEURIPS 2025oralarXiv:2509.21100
13
citations

Video Perception Models for 3D Scene Synthesis

Rui Huang, Guangyao Zhai, Zuria Bauer et al.

NEURIPS 2025posterarXiv:2506.20601
5
citations

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NEURIPS 2025oralarXiv:2503.21776
236
citations

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang, Yanrui Yu, Ye Yuan et al.

NEURIPS 2025oralarXiv:2505.12434
30
citations

Video Summarization with Large Language Models

Min Jung Lee, Dayoung Gong, Minsu Cho

CVPR 2025posterarXiv:2504.11199
8
citations

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Qi Li, Runpeng Yu, Xinchao Wang

NEURIPS 2025oralarXiv:2506.03179
5
citations

Vision Function Layer in Multimodal LLMs

Cheng Shi, Yizhou Yu, Sibei Yang

NEURIPS 2025posterarXiv:2509.24791
4
citations

Visual Instruction Bottleneck Tuning

Changdae Oh, Jiatong Li, Shawn Im et al.

NEURIPS 2025posterarXiv:2505.13946
2
citations

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun et al.

NEURIPS 2025posterarXiv:2411.16034

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang et al.

NEURIPS 2025spotlightarXiv:2501.01957
130
citations

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.

ICLR 2025posterarXiv:2503.00043
1
citations

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias et al.

ICCV 2025posterarXiv:2510.14672
2
citations

Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

Xiaoyu Yang, Jie Lu, En Yu

NEURIPS 2025oral
6
citations

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo et al.

NEURIPS 2025oralarXiv:2505.18110
3
citations

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.

NEURIPS 2025spotlightarXiv:2505.15277
8
citations

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Meidan Ding et al.

ICCV 2025posterarXiv:2412.02141
10
citations

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

jian ma, Qirong Peng, Xu Guo et al.

ICCV 2025posterarXiv:2503.06134
5
citations

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang, hongzhen wang, Zonghao Guo et al.

CVPR 2025highlightarXiv:2503.23771
24
citations

You Only Communicate Once: One-shot Federated Low-Rank Adaptation of MLLM

Binqian Xu, Haiyang Mei, Zechen Bai et al.

NEURIPS 2025poster

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen et al.

NEURIPS 2025posterarXiv:2505.22396
1
citations
Previous
1...345
Next