"vision language models" Papers

36 papers found

Aligning Effective Tokens with Video Anomaly in Large Language Models

YINGXIAN Chen, Jiahui Liu, Ruidi Fan et al.

ICCV 2025posterarXiv:2508.06350
1
citations

Are Large Vision Language Models Good Game Players?

Xinyu Wang, Bohan Zhuang, Qi Wu

ICLR 2025posterarXiv:2503.02358
13
citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025posterarXiv:2501.03225
21
citations

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

kaiyuan Li, Xiaoyue Chen, Chen Gao et al.

NeurIPS 2025posterarXiv:2505.22038
4
citations

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Zeyi Huang, Yuyang Ji, Xiaofang Wang et al.

CVPR 2025posterarXiv:2501.04336
7
citations

Can We Talk Models Into Seeing the World Differently?

Paul Gavrikov, Jovita Lukasik, Steffen Jung et al.

ICLR 2025posterarXiv:2403.09193
15
citations

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

ICLR 2025posterarXiv:2407.01449
91
citations

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Yi Ding, Bolian Li, Ruqi Zhang

ICLR 2025posterarXiv:2410.06625
42
citations

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han et al.

ICCV 2025posterarXiv:2501.01986
22
citations

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Haolong Yan, Yeqing Shen, Xin Huang et al.

NeurIPS 2025posterarXiv:2512.02423

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Eunkyu Park, Minyeong Kim, Gunhee Kim

CVPR 2025posterarXiv:2506.10286
3
citations

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.

ICCV 2025posterarXiv:2504.18406

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo et al.

NeurIPS 2025posterarXiv:2501.19252
20
citations

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Yudong Liu, Jingwei Sun, Yueqian Lin et al.

ICCV 2025posterarXiv:2503.10742
6
citations

Knowledge Transfer from Interaction Learning

Yilin Gao, Kangyi Chen, Zhongxing Peng et al.

ICCV 2025posterarXiv:2509.18733

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou, Yi Zhan, Zhongbin Zhang et al.

NeurIPS 2025oralarXiv:2505.16602
3
citations

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao, Lin Song, Yukang Chen et al.

NeurIPS 2025posterarXiv:2505.13031
18
citations

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang et al.

NeurIPS 2025posterarXiv:2509.25851

OOD-Barrier: Build a Middle-Barrier for Open-Set Single-Image Test Time Adaptation via Vision Language Models

Boyang Peng, Sanqing Qu, Tianpei Zou et al.

NeurIPS 2025poster

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Wei Suo, Ji Ma, Mengyang Sun et al.

ICCV 2025posterarXiv:2412.06458
1
citations

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025posterarXiv:2409.14485
144
citations

Vision Language Models are In-Context Value Learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.

ICLR 2025oralarXiv:2411.04549
43
citations

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang, Baolu Li, Yiming Zhang et al.

ICCV 2025posterarXiv:2503.23368
17
citations

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu, Yifan Xu, Yi Li et al.

AAAI 2024paperarXiv:2308.09936
190
citations

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Jeongkee Lim, Yusung Kim

ECCV 2024posterarXiv:2408.02261
5
citations

Detecting and Preventing Hallucinations in Large Vision Language Models

Anisha Gunjal, Jihan Yin, Erhan Bas

AAAI 2024paperarXiv:2308.06394
256
citations

Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

Jin Wang, Shichao Dong, Yapeng Zhu et al.

ICML 2024poster

LCA-on-the-Line: Benchmarking Out of Distribution Generalization with Class Taxonomies

Jia Shi, Gautam Rajendrakumar Gare, Jinjin Tian et al.

ICML 2024poster

Leveraging VLM-Based Pipelines to Annotate 3D Objects

Rishabh Kabra, Loic Matthey, Alexander Lerchner et al.

ICML 2024poster

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.

ECCV 2024posterarXiv:2312.03766
17
citations

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Soroush Nasiriany, Fei Xia, Wenhao Yu et al.

ICML 2024poster

ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection

Joonhyun Jeong, Geondo Park, Jayeon Yoo et al.

AAAI 2024paperarXiv:2312.07266
16
citations

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang et al.

ICML 2024poster

Soft Prompt Generation for Domain Generalization

Shuanghao Bai, Yuedi Zhang, Wanqi Zhou et al.

ECCV 2024posterarXiv:2404.19286
30
citations

TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu, Lu Pang, Tengfei Ma et al.

ECCV 2024posterarXiv:2409.19232
23
citations