2025 "vision-language models" Papers
321 papers found • Page 7 of 7
Vision Transformers Don't Need Trained Registers
Nicholas Jiang, Amil Dravid, Alexei Efros et al.
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Jialiang Kang, Han Shu, Wenshuo Li et al.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar et al.
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni, YuTao Fan, Lei Zhang et al.
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang et al.
VladVA: Discriminative Fine-tuning of LVLMs
Yassine Ouali, Adrian Bulat, ALEXANDROS XENOS et al.
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Ruifei Zhang, Wei Zhang, Xiao Tan et al.
VLMaterial: Procedural Material Generation with Large Vision-Language Models
Beichen Li, Rundi Wu, Armando Solar-Lezama et al.
VLMs can Aggregate Scattered Training Patches
Zhanhui Zhou, Lingjie Chen, Chao Yang et al.
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang et al.
Vocabulary-Guided Gait Recognition
Panjian Huang, Saihui Hou, Chunshui Cao et al.
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Wenhao Li, Qiangchang Wang, Xianjing Meng et al.
Weakly-Supervised Learning of Dense Functional Correspondences
Stefan Stojanov, Linan Zhao, Yunzhi Zhang et al.
Web Artifact Attacks Disrupt Vision Language Models
Maan Qraitem, Piotr Teterwak, Kate Saenko et al.
What Makes a Maze Look Like a Maze?
Joy Hsu, Jiayuan Mao, Joshua B Tenenbaum et al.
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo, Yingying Zhang, Xue Yang et al.
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
Yangfu Li, Hongjian Zhan, Tianyi Chen et al.
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Ge Zheng, Jiaye Qian, Jiajin Tang et al.
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Ailin Deng, Tri Cao, Zhirui Chen et al.
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
seil kang, Jinyeong Kim, Junhyeok Kim et al.