2025 "vision-language models" Papers
228 papers found • Page 1 of 5
$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Hongyuan Dong, Dingkang Yang, Xiao Liang et al.
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg, Naman Deep Singh, Matthias Hein
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin et al.
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan, Hanqing Liu, Yao Huang et al.
AgroBench: Vision-Language Model Benchmark in Agriculture
Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka et al.
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
Aligning Visual Contrastive learning models via Preference Optimization
Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
Approximate Domain Unlearning for Vision-Language Models
Kodai Kawamura, Yuta Goto, Rintaro Yanagi et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao, Yizeng Han, Jiasheng Tang et al.
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
Attribute-based Visual Reprogramming for Vision-Language Models
Chengyi Cai, Zesheng Ye, Lei Feng et al.
Automated Model Discovery via Multi-modal & Multi-step Pipeline
Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.
Bayesian Test-Time Adaptation for Vision-Language Models
Lihua Zhou, Mao Ye, Shuaifeng Li et al.
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren, Fan Tang, He Zhao et al.
Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Yupeng Hu, Changxing Ding, Chang Sun et al.
Black Sheep in the Herd: Playing with Spuriously Correlated Attributes for Vision-Language Recognition
Xinyu Tian, Shu Zou, Zhaoyuan Yang et al.
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu, Chen Zhao, Tianqi Xu et al.
Bridging the gap to real-world language-grounded visual concept learning
whie jung, Semin Kim, Junee Kim et al.
C-CLIP: Multimodal Continual Learning for Vision-Language Model
Wenzhuo Liu, Fei Zhu, Longhui Wei et al.
CF-VLM:CounterFactual Vision-Language Fine-tuning
jusheng zhang, Kaitong Cai, Yijia Fan et al.
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi, Ming Ding, Weihan Wang et al.
Compositional Entailment Learning for Hyperbolic Vision-Language Models
Avik Pal, Max van Spengler, Guido D'Amely di Melendugno et al.
Conditional Representation Learning for Customized Tasks
Honglin Liu, Chao Sun, Peng Hu et al.
Context-Aware Academic Emotion Dataset and Benchmark
Luming Zhao, Jingwen Xuan, Jiamin Lou et al.
Context-Aware Multimodal Pretraining
Karsten Roth, Zeynep Akata, Dima Damen et al.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li et al.
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
Seung Hyun Lee, Jijun jiang, Yiran Xu et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci et al.
CrypticBio: A Large Multimodal Dataset for Visually Confusing Species
Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Junjie Wang, BIN CHEN, Yulin Li et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning
Yilun Li, Miaomiao Cheng, Xu Han et al.
DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz et al.
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria, Adinath Dukre, feilong tang et al.
DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong et al.
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.