"multimodal large language models" Papers

300 papers found • Page 2 of 6

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang, Zhijian Chen, Kai Lan et al.

ICCV 2025posterarXiv:2503.14478
12
citations

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

JIAHE ZHAO, rongkun Zheng, Yi Wang et al.

ICCV 2025posterarXiv:2507.10302

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li et al.

CVPR 2025posterarXiv:2408.15045
10
citations

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu, Zhibo Yang, Yuliang Liu et al.

ICCV 2025posterarXiv:2508.08589
4
citations

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Janet Wang, Yunbei Zhang, Zhengming Ding et al.

NEURIPS 2025posterarXiv:2506.12323
2
citations

Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou, Di Lu, Yizhou Wang et al.

NEURIPS 2025posterarXiv:2510.02912
7
citations

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

ICLR 2025posterarXiv:2403.20271
86
citations

DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning

Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.

NEURIPS 2025posterarXiv:2505.20241
9
citations

DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving

Zhenhua Xu, Yan Bai, Yujia Zhang et al.

CVPR 2025highlight
19
citations

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.

ICLR 2025posterarXiv:2412.00876
38
citations

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi et al.

NEURIPS 2025oralarXiv:2505.21076
8
citations

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang et al.

ICLR 2025posterarXiv:2408.15998
116
citations

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.

ICCV 2025posterarXiv:2508.06492
11
citations

EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu et al.

NEURIPS 2025posterarXiv:2503.08221
8
citations

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Yuping He, Yifei Huang, Guo Chen et al.

NEURIPS 2025oralarXiv:2507.18342
10
citations

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou, Junbin Xiao, Qingyun Li et al.

CVPR 2025posterarXiv:2502.07411
29
citations

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu et al.

NEURIPS 2025oralarXiv:2510.23569
4
citations

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi et al.

NEURIPS 2025posterarXiv:2412.09585
4
citations

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yunlong Tang, Daiki Shimada, Jing Bi et al.

AAAI 2025paperarXiv:2403.16276
25
citations

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan et al.

AAAI 2025paperarXiv:2412.09817

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang et al.

NEURIPS 2025posterarXiv:2511.10648

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Shraman Pramanick, Effrosyni Mavroudi, Yale Song et al.

ICCV 2025highlightarXiv:2510.17023

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang, Amish Sethi, Matthew Kuo et al.

NEURIPS 2025oralarXiv:2510.15963

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Shivam Duggal, Yushi Hu, Oscar Michel et al.

CVPR 2025posterarXiv:2504.18509
6
citations

EventGPT: Event Stream Understanding with Multimodal Large Language Models

shaoyu liu, Jianing Li, guanghui zhao et al.

CVPR 2025posterarXiv:2412.00832
9
citations

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Jiayi Guo, Zhao Junhao, Chaoqun Du et al.

CVPR 2025posterarXiv:2406.04295

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang, Xusen Ma, Xianxu Hou et al.

CVPR 2025posterarXiv:2503.21457
8
citations

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Renshan Zhang, Rui Shao, Gongwei Chen et al.

ICCV 2025posterarXiv:2501.16297
11
citations

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Gen Luo, Yiyi Zhou, Yuxin Zhang et al.

ICLR 2025posterarXiv:2403.03003
100
citations

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Zichen Tang, Haihong E, Jiacheng Liu et al.

ICCV 2025posterarXiv:2508.04625
2
citations

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin et al.

AAAI 2025paperarXiv:2409.10197
64
citations

Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models

Hai Yan, Haijian Ma, Xiaowen Cai et al.

NEURIPS 2025poster

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

Bo Tong, Bokai Lai, Yiyi Zhou et al.

CVPR 2025posterarXiv:2412.04317
4
citations

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

shengming yuan, Xinyu Lyu, Shuailong Wang et al.

NEURIPS 2025posterarXiv:2510.11190

FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

Yanbing Zhang, Zhe Wang, Qin Zhou et al.

ICCV 2025posterarXiv:2507.15249
1
citations

From Imitation to Innovation: The Emergence of AI's Unique Artistic Styles and the Challenge of Copyright Protection

Zexi Jia, Chuanwei Huang, Hongyan Fei et al.

ICCV 2025posterarXiv:2507.04769
3
citations

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu et al.

NEURIPS 2025posterarXiv:2506.04897
1
citations

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang, Yao Lai, Aoxue Li et al.

NEURIPS 2025spotlightarXiv:2505.20147
20
citations

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Kaihang Pan, Wang Lin, Zhongqi Yue et al.

CVPR 2025posterarXiv:2504.14666
18
citations

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma, Yuying Ge, Teng Wang et al.

ICCV 2025posterarXiv:2503.19480
9
citations

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Xudong LU, Yinghao Chen, Renshou Wu et al.

ICCV 2025posterarXiv:2503.06019

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li et al.

NEURIPS 2025spotlightarXiv:2505.21375
11
citations

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao, Renjie Pi, Jipeng Zhang et al.

ICLR 2025posterarXiv:2312.11370
170
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Rongyao Fang, Chengqi Duan, Kun Wang et al.

NEURIPS 2025poster
60
citations

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

Yi Fang, Bowen Jin, Jiacheng Shen et al.

CVPR 2025posterarXiv:2502.11925
3
citations

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Xiaomeng Chu, Jiajun Deng, Guoliang You et al.

ICCV 2025posterarXiv:2503.16013
2
citations

Grounding Multimodal Large Language Model in GUI World

Weixian Lei, Difei Gao, Mike Zheng Shou

ICLR 2025poster

Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Kaiqing Lin, Zhiyuan Yan, Ke-Yue Zhang et al.

NEURIPS 2025posterarXiv:2505.19582
2
citations

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

Pengfei Zhao, Rongbo Luan, Wei Zhang et al.

NEURIPS 2025posterarXiv:2506.06970
1
citations

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Tianyi Bai, Yuxuan Fan, Qiu Jiantao et al.

NEURIPS 2025posterarXiv:2506.07227
2
citations