Poster "multimodal large language models" Papers

228 papers found • Page 1 of 5

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu et al.

CVPR 2025posterarXiv:2412.00556
16
citations

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu, Wenhao Lin, Yiyi Zhou et al.

NEURIPS 2025posterarXiv:2411.19628
5
citations

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Lequan Lin, Dai Shi, Andi Han et al.

NEURIPS 2025posterarXiv:2511.09833

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Yiyang Du, Xiaochen Wang, Chi Chen et al.

CVPR 2025posterarXiv:2503.23733
5
citations

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

CVPR 2025posterarXiv:2502.21271
68
citations

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Xue zhucun, Jiangning Zhang, Xie Xurong et al.

NEURIPS 2025posterarXiv:2506.13589
7
citations

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Xiaojun Jia, Sensen Gao, Simeng Qin et al.

NEURIPS 2025posterarXiv:2505.21494
17
citations

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang, Xun Yang, Yanlong Xu et al.

NEURIPS 2025posterarXiv:2511.10017
1
citations

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.

ICLR 2025posterarXiv:2410.08164
102
citations

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.

ICCV 2025posterarXiv:2507.02664
13
citations

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Lingfeng Wang, Hualing Lin, Senda Chen et al.

NEURIPS 2025posterarXiv:2505.16495
2
citations

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Junhao Cheng, Yuying Ge, Yixiao Ge et al.

ICCV 2025posterarXiv:2504.01014
5
citations

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

YUEJIAO SU, Yi Wang, Qiongyang Hu et al.

CVPR 2025posterarXiv:2504.01472
4
citations

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man, De-An Huang, Guilin Liu et al.

CVPR 2025posterarXiv:2505.23766
19
citations

AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Yi-Ting Shen, Sungmin Eum, Doheon Lee et al.

ICCV 2025posterarXiv:2503.22884

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

Hangtao Zhang, Chenyu Zhu, Xianlong Wang et al.

ICLR 2025poster
6
citations

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Jianting Tang, Yubo Wang, Haoyu Cao et al.

ICCV 2025posterarXiv:2508.06895
1
citations

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao, Xuqi Liu, Zhongqi Yue et al.

ICCV 2025posterarXiv:2504.06606
10
citations

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Qizhe Zhang, Mengzhen Liu, Lichen Li et al.

NEURIPS 2025posterarXiv:2506.10967
19
citations

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.

ICCV 2025posterarXiv:2411.14401
11
citations

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin, Jaemin Cho, Amir Zadeh et al.

NEURIPS 2025posterarXiv:2508.05954
6
citations

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

junyan ye, Dongzhi JIANG, Jun He et al.

NEURIPS 2025posterarXiv:2510.09361
2
citations

Bridging Compressed Image Latents and Multimodal Large Language Models

Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng et al.

ICLR 2025posterarXiv:2407.19651
6
citations

Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu, Enxin Song, Wenhao Chai et al.

ICCV 2025posterarXiv:2507.02591
6
citations

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen et al.

NEURIPS 2025posterarXiv:2502.14914
3
citations

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Junho Kim, Hyungjin Chung, Byung-Hoon Kim

ICCV 2025posterarXiv:2411.06869
1
citations

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Guo Chen, Yicheng Liu, Yifei Huang et al.

ICLR 2025posterarXiv:2412.12075
41
citations

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang, Lingling Zhang, Jie Ma et al.

NEURIPS 2025posterarXiv:2505.19076
5
citations

CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Jinlan Fu, Shenzhen Huangfu, Hao Fei et al.

ICLR 2025posterarXiv:2501.16629
19
citations

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Haoran Sun, Yankai Jiang, Wenjie Lou et al.

NEURIPS 2025posterarXiv:2506.16962
6
citations

CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.

ICCV 2025posterarXiv:2412.05243
6
citations

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.

ICCV 2025posterarXiv:2508.11616

CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Zheyuan Liu, Munan Ning, Qihui Zhang et al.

NEURIPS 2025posterarXiv:2507.04451
4
citations

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang, Zhijian Chen, Kai Lan et al.

ICCV 2025posterarXiv:2503.14478
12
citations

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

JIAHE ZHAO, rongkun Zheng, Yi Wang et al.

ICCV 2025posterarXiv:2507.10302

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li et al.

CVPR 2025posterarXiv:2408.15045
10
citations

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu, Zhibo Yang, Yuliang Liu et al.

ICCV 2025posterarXiv:2508.08589
4
citations

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Janet Wang, Yunbei Zhang, Zhengming Ding et al.

NEURIPS 2025posterarXiv:2506.12323
2
citations

Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou, Di Lu, Yizhou Wang et al.

NEURIPS 2025posterarXiv:2510.02912
7
citations

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

ICLR 2025posterarXiv:2403.20271
86
citations

DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning

Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.

NEURIPS 2025posterarXiv:2505.20241
9
citations

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.

ICLR 2025posterarXiv:2412.00876
38
citations

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang et al.

ICLR 2025posterarXiv:2408.15998
116
citations

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.

ICCV 2025posterarXiv:2508.06492
11
citations

EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu et al.

NEURIPS 2025posterarXiv:2503.08221
8
citations

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou, Junbin Xiao, Qingyun Li et al.

CVPR 2025posterarXiv:2502.07411
29
citations

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi et al.

NEURIPS 2025posterarXiv:2412.09585
4
citations

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang et al.

NEURIPS 2025posterarXiv:2511.10648

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Shivam Duggal, Yushi Hu, Oscar Michel et al.

CVPR 2025posterarXiv:2504.18509
6
citations

EventGPT: Event Stream Understanding with Multimodal Large Language Models

shaoyu liu, Jianing Li, guanghui zhao et al.

CVPR 2025posterarXiv:2412.00832
9
citations
Previous
123...5
Next