2025 "vision language models" Papers

29 papers found

Aligning Effective Tokens with Video Anomaly in Large Language Models

YINGXIAN Chen, Jiahui Liu, Ruidi Fan et al.

ICCV 2025posterarXiv:2508.06350
1
citations

Are Large Vision Language Models Good Game Players?

Xinyu Wang, Bohan Zhuang, Qi Wu

ICLR 2025posterarXiv:2503.02358
13
citations

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Xubing Ye, Yukang Gan, Yixiao Ge et al.

CVPR 2025posterarXiv:2412.00447
31
citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025posterarXiv:2501.03225
21
citations

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

kaiyuan Li, Xiaoyue Chen, Chen Gao et al.

NeurIPS 2025posterarXiv:2505.22038
4
citations

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Zeyi Huang, Yuyang Ji, Xiaofang Wang et al.

CVPR 2025posterarXiv:2501.04336
7
citations

Can We Talk Models Into Seeing the World Differently?

Paul Gavrikov, Jovita Lukasik, Steffen Jung et al.

ICLR 2025posterarXiv:2403.09193
15
citations

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

ICLR 2025posterarXiv:2407.01449
91
citations

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Yi Ding, Bolian Li, Ruqi Zhang

ICLR 2025posterarXiv:2410.06625
42
citations

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han et al.

ICCV 2025posterarXiv:2501.01986
22
citations

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Haolong Yan, Yeqing Shen, Xin Huang et al.

NeurIPS 2025posterarXiv:2512.02423

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Eunkyu Park, Minyeong Kim, Gunhee Kim

CVPR 2025posterarXiv:2506.10286
3
citations

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.

ICCV 2025posterarXiv:2504.18406

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo et al.

NeurIPS 2025posterarXiv:2501.19252
20
citations

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Yudong Liu, Jingwei Sun, Yueqian Lin et al.

ICCV 2025posterarXiv:2503.10742
6
citations

Knowledge Transfer from Interaction Learning

Yilin Gao, Kangyi Chen, Zhongxing Peng et al.

ICCV 2025posterarXiv:2509.18733

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou, Yi Zhan, Zhongbin Zhang et al.

NeurIPS 2025oralarXiv:2505.16602
3
citations

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao, Lin Song, Yukang Chen et al.

NeurIPS 2025posterarXiv:2505.13031
18
citations

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang et al.

NeurIPS 2025posterarXiv:2509.25851

OOD-Barrier: Build a Middle-Barrier for Open-Set Single-Image Test Time Adaptation via Vision Language Models

Boyang Peng, Sanqing Qu, Tianpei Zou et al.

NeurIPS 2025poster

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Wei Suo, Ji Ma, Mengyang Sun et al.

ICCV 2025posterarXiv:2412.06458
1
citations

Semantic Discrepancy-aware Detector for Image Forgery Identification

Wang Ziye, Minghang Yu, Chunyan Xu et al.

ICCV 2025posterarXiv:2508.12341

SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan et al.

NeurIPS 2025posterarXiv:2506.20990
1
citations

SpiritSight Agent: Advanced GUI Agent with One Look

Zhiyuan Huang, Ziming Cheng, Junting Pan et al.

CVPR 2025posterarXiv:2503.03196
11
citations

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

Lehan He, Zeren Chen, Zhelun Shi et al.

NeurIPS 2025posterarXiv:2411.17265
3
citations

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025posterarXiv:2409.14485
144
citations

Vision Language Models are In-Context Value Learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.

ICLR 2025oralarXiv:2411.04549
43
citations

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang, Baolu Li, Yiming Zhang et al.

ICCV 2025posterarXiv:2503.23368
17
citations