2025 "vision-language models" Papers

321 papers found • Page 7 of 7

Vision Transformers Don't Need Trained Registers

Nicholas Jiang, Amil Dravid, Alexei Efros et al.

NEURIPS 2025spotlightarXiv:2506.08010
12
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025posterarXiv:2509.15235
2
citations

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu et al.

ICLR 2025posterarXiv:2410.10594
121
citations

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar et al.

ICLR 2025posterarXiv:2405.15683
15
citations

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni, YuTao Fan, Lei Zhang et al.

ICLR 2025posterarXiv:2410.03321
20
citations

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025posterarXiv:2503.01785
347
citations

VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, ALEXANDROS XENOS et al.

CVPR 2025posterarXiv:2412.04378
11
citations

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan et al.

ICCV 2025posterarXiv:2511.06256
4
citations

VLMaterial: Procedural Material Generation with Large Vision-Language Models

Beichen Li, Rundi Wu, Armando Solar-Lezama et al.

ICLR 2025posterarXiv:2501.18623
5
citations

VLMs can Aggregate Scattered Training Patches

Zhanhui Zhou, Lingjie Chen, Chao Yang et al.

NEURIPS 2025posterarXiv:2506.03614

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NEURIPS 2025spotlightarXiv:2504.08837
175
citations

Vocabulary-Guided Gait Recognition

Panjian Huang, Saihui Hou, Chunshui Cao et al.

NEURIPS 2025poster

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng et al.

NEURIPS 2025posterarXiv:2509.25033
2
citations

Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov, Linan Zhao, Yunzhi Zhang et al.

ICCV 2025posterarXiv:2509.03893

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem, Piotr Teterwak, Kate Saenko et al.

ICCV 2025posterarXiv:2503.13652
2
citations

What Makes a Maze Look Like a Maze?

Joy Hsu, Jiayuan Mao, Joshua B Tenenbaum et al.

ICLR 2025posterarXiv:2409.08202
13
citations

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang et al.

ICCV 2025posterarXiv:2503.07588
14
citations

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li, Hongjian Zhan, Tianyi Chen et al.

NEURIPS 2025posterarXiv:2505.10118
1
citations

Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng, Jiaye Qian, Jiajin Tang et al.

ICCV 2025posterarXiv:2510.20229
6
citations

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Ailin Deng, Tri Cao, Zhirui Chen et al.

CVPR 2025posterarXiv:2503.02199
33
citations

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

CVPR 2025highlightarXiv:2503.06287
31
citations
← Previous
1...567
Next →