Poster "vision-language models" Papers

475 papers found • Page 7 of 10

Understanding Co-speech Gestures in-the-wild

Sindhu Hegde, K R Prajwal, Taein Kwon et al.

ICCV 2025arXiv:2503.22668
2
citations

Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca et al.

ICCV 2025arXiv:2412.01370
1
citations

Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro et al.

NEURIPS 2025arXiv:2510.19307
2
citations

Unlearning the Noisy Correspondence Makes CLIP More Robust

Haochen Han, Alex Jinpeng Wang, Peijun Ye et al.

ICCV 2025arXiv:2507.03434
1
citations

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang, Fei Wei, Yong Wang et al.

ICCV 2025arXiv:2507.00721

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin et al.

ICCV 2025arXiv:2412.09616
17
citations

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng, Kai Han

NEURIPS 2025arXiv:2511.22664
1
citations

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025arXiv:2412.10471
31
citations

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen et al.

NEURIPS 2025

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa et al.

CVPR 2025arXiv:2504.09795
27
citations

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Muchao Ye, Weiyang Liu, Pan He

CVPR 2025arXiv:2412.01095
10
citations

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu et al.

ICCV 2025arXiv:2411.18651
1
citations

VideoAuteur: Towards Long Narrative Video Generation

Junfei Xiao, Feng Cheng, Lu Qi et al.

ICCV 2025arXiv:2501.06173
9
citations

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.

NEURIPS 2025arXiv:2505.15952
4
citations

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel, Walid Bousselham, Anna Kukleva et al.

CVPR 2025arXiv:2503.20348

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NEURIPS 2025arXiv:2505.01481
15
citations

VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou et al.

NEURIPS 2025arXiv:2506.09049
9
citations

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez et al.

ICCV 2025arXiv:2507.07620
2
citations

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Xueqing Wu, Yuheng Ding, Bingxuan Li et al.

CVPR 2025arXiv:2412.02172
13
citations

VisionArena: 230k Real World User-VLM Conversations with Preference Labels

Christopher Chou, Lisa Dunlap, Wei-Lin Chiang et al.

CVPR 2025arXiv:2412.08687
15
citations

Vision-Language Model IP Protection via Prompt-based Learning

Lianyu Wang, Meng Wang, Huazhu Fu et al.

CVPR 2025arXiv:2503.02393

Vision-Language Models Can't See the Obvious

YASSER ABDELAZIZ DAHOU DJILALI, Ngoc Huynh, Phúc Lê Khắc et al.

ICCV 2025arXiv:2507.04741
7
citations

Vision-Language Models Do Not Understand Negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian et al.

CVPR 2025arXiv:2501.09425
38
citations

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.

NEURIPS 2025arXiv:2507.07104
2
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025arXiv:2509.15235
2
citations

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu et al.

ICLR 2025arXiv:2410.10594
127
citations

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar et al.

ICLR 2025arXiv:2405.15683
17
citations

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni, YuTao Fan, Lei Zhang et al.

ICLR 2025arXiv:2410.03321
20
citations

Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam, Soowon Son, Zhan Xu et al.

CVPR 2025arXiv:2503.15406
6
citations

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025arXiv:2503.01785
357
citations

VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, ALEXANDROS XENOS et al.

CVPR 2025arXiv:2412.04378
11
citations

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan et al.

ICCV 2025arXiv:2511.06256
5
citations

VLMaterial: Procedural Material Generation with Large Vision-Language Models

Beichen Li, Rundi Wu, Armando Solar-Lezama et al.

ICLR 2025arXiv:2501.18623
5
citations

VLMs can Aggregate Scattered Training Patches

Zhanhui Zhou, Lingjie Chen, Chao Yang et al.

NEURIPS 2025arXiv:2506.03614

Vocabulary-Guided Gait Recognition

Panjian Huang, Saihui Hou, Chunshui Cao et al.

NEURIPS 2025

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng et al.

NEURIPS 2025arXiv:2509.25033
4
citations

Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov, Linan Zhao, Yunzhi Zhang et al.

ICCV 2025arXiv:2509.03893

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem, Piotr Teterwak, Kate Saenko et al.

ICCV 2025arXiv:2503.13652
2
citations

What Makes a Maze Look Like a Maze?

Joy Hsu, Jiayuan Mao, Joshua B Tenenbaum et al.

ICLR 2025arXiv:2409.08202
13
citations

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang et al.

ICCV 2025arXiv:2503.07588
14
citations

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li, Hongjian Zhan, Tianyi Chen et al.

NEURIPS 2025arXiv:2505.10118
1
citations

Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng, Jiaye Qian, Jiajin Tang et al.

ICCV 2025arXiv:2510.20229
8
citations

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Ailin Deng, Tri Cao, Zhirui Chen et al.

CVPR 2025arXiv:2503.02199
34
citations

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.

CVPR 2024arXiv:2312.12730
65
citations

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Yanting Yang, Minghao Chen, Qibo Qiu et al.

ECCV 2024arXiv:2407.14872
5
citations

Adaptive Multi-task Learning for Few-shot Object Detection

Yan Ren, Yanling Li, Wai-Kin Adams Kong

ECCV 2024
6
citations

Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models

MENGYU ZHENG, Yehui Tang, Zhiwei Hao et al.

ECCV 2024
6
citations

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang et al.

ECCV 2024arXiv:2311.11261
34
citations

Amend to Alignment: Decoupled Prompt Tuning for Mitigating Spurious Correlation in Vision-Language Models

Jie ZHANG, Xiaosong Ma, Song Guo et al.

ICML 2024

A Multimodal Automated Interpretability Agent

Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang et al.

ICML 2024arXiv:2404.14394
45
citations