2025 "vision-language models" Papers
321 papers found • Page 6 of 7
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning
XIN Hu, Ke Qin, Guiduo Duan et al.
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
Kevin Miller, Aditya Gangrade, Samarth Mishra et al.
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation
Nairouz Mrabah, Nicolas Richet, Ismail Ayed et al.
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Wufei Ma, Yu-Cheng Chou, Qihao Liu et al.
Statistics Caching Test-Time Adaptation for Vision-Language Models
Zenghao Guan, Yucan Zhou, Wu Liu et al.
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong Liu, Song-Li Wu, Sule Bai et al.
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari et al.
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving
Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin et al.
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
Bin Wu, Wuxuan Shi, Jinqiao Wang et al.
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Yifei Qian, Zhongliang Guo, Bowen Deng et al.
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
Makoto Shing, Kou Misaki, Han Bao et al.
TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina et al.
TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
Jiankang Chen, Tianke Zhang, Changyi Liu et al.
Teaching Human Behavior Improves Content Understanding Abilities Of VLMs
SOMESH SINGH, Harini S I, Yaman Singla et al.
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh, Nimrod Shabtay, Eli Schwartz et al.
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.
Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
Mehrdad Noori, David OSOWIECHI, Gustavo Vargas Hakim et al.
Text to Sketch Generation with Multi-Styles
Tengjie Li, Shikui Tu, Lei Xu
The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models
Lijun Sheng, Jian Liang, Ran He et al.
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
HONG LI, Nanxi Li, Yuanjie Chen et al.
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Alessandro Serra, Francesco Ortu, Emanuele Panizon et al.
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang, Yang Sui, Jinqi Xiao et al.
Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
Young Kyun Jang, Ser-Nam Lim
Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product
Paul Albert, Frederic Zhang, Hemanth Saratchandran et al.
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Hao Guo, Xugong Qin, Jun Jie Ou Yang et al.
Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models
Shenglong Zhou, Manjiang Yin, Leiyu Sun et al.
TRAP: Targeted Redirecting of Agentic Preferences
Hangoo Kang, Jehyeok Yeon, Gagandeep Singh
Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation
jusheng zhang, Yijia Fan, Zimo Wen et al.
TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models
Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.
TULIP: Token-length Upgraded CLIP
Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki Asano et al.
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Yunheng Li, Yuxuan Li, Quan-Sheng Zeng et al.
Understanding Museum Exhibits using Vision-Language Reasoning
Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca et al.
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro et al.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
Xiao Zhang, Fei Wei, Yong Wang et al.
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng, Kai Han
VCA: Video Curious Agent for Long Video Understanding
Zeyuan Yang, Delin Chen, Xueyang Yu et al.
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Run Luo, Renke Shan, Longze Chen et al.
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Ryota Tanaka, Taichi Iki, Taku Hasegawa et al.
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
Muchao Ye, Weiyang Liu, Pan He
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao, Feng Cheng, Lu Qi et al.
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.
VideoGEM: Training-free Action Grounding in Videos
Felix Vogel, Walid Bousselham, Anna Kukleva et al.
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi et al.
VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou et al.
Vision-centric Token Compression in Large Language Model
Ling Xing, Alex Jinpeng Wang, Rui Yan et al.
Vision-Language Model IP Protection via Prompt-based Learning
Lianyu Wang, Meng Wang, Huazhu Fu et al.
Vision-Language Models Can't See the Obvious
YASSER ABDELAZIZ DAHOU DJILALI, Ngoc Huynh, Phúc Lê Khắc et al.
Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian et al.
Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.