"vision-language models" Papers
167 papers found • Page 3 of 4
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
Yichao Cai, Yuhang Liu, Zhen Zhang et al.
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu, Wenwei Zhang, Lumin XU et al.
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto, Mohammad Sami Nur Islam, Martin Klissarov et al.
COMMA: Co-articulated Multi-Modal Learning
Authors: Lianyu Hu, Liqing Gao, Zekang Liu et al.
Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
Hao Tan, Jun Li, Yizhuang Zhou et al.
Conceptual Codebook Learning for Vision-Language Models
Yi Zhang, Ke Yu, Siqi Wu et al.
Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models
Zhengbo Wang, Jian Liang, Ran He et al.
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
Lorenzo Baraldi, Federico Cocchi, Marcella Cornia et al.
DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection
Zhi Zhou, Ming Yang, Jiang-Xin Shi et al.
Delving into Multimodal Prompting for Fine-Grained Visual Classification
Xin Jiang, Hao Tang, Junyao Gao et al.
Domain-Controlled Prompt Learning
Qinglong Cao, Zhengqin Xu, Yuntian Chen et al.
Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior
Shuyu Cheng, Yibo Miao, Yinpeng Dong et al.
Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection
Chentao Cao, Zhun Zhong, Zhanke Zhou et al.
Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models
Mingrui Wu, Jiayi Ji, Oucheng Huang et al.
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
jiazhou zhou, Xu Zheng, Yuanhuiyi Lyu et al.
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
Tong Shao, Zhuotao Tian, Hang Zhao et al.
Exploring Intrinsic Dimension for Vision-Language Model Pruning
Hanzhang Wang, Jiawen Zhang, Qingyuan Ma
Extracting Training Data From Document-Based VQA Models
Francesco Pinto, Nathalie Rauschmayr, Florian Tramer et al.
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle et al.
Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations
Yongshuo Zong, Tingyang Yu, Ruchika Chavhan et al.
GalLop: Learning global and local prompts for vision-language models
Marc Lafon, Elias Ramzi, Clément Rambour et al.
Generalizing to Unseen Domains via Text-guided Augmentation
Daiqing Qi, Handong Zhao, Aidong Zhang et al.
GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
Ling Li, Yu Ye, Bingchuan Jiang et al.
Gradient-based Visual Explanation for Transformer-based CLIP
Chenyang ZHAO, Kun Wang, Xingyu Zeng et al.
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen, Zhuokai Zhao, HONGYIN LUO et al.
Harmonizing Generalization and Personalization in Federated Prompt Learning
Tianyu Cui, Hongxia Li, Jingya Wang et al.
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey, Euan Ong, Stuart Russell et al.
Improving fine-grained understanding in image-text pre-training
Ioana Bica, Anastasija Ilic, Matthias Bauer et al.
Improving Zero-Shot Generalization for CLIP with Variational Adapter
Ziqian Lu, Fengli Shen, Mushui Liu et al.
Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition
Yicheng Liu, Jie Wen, Chengliang Liu et al.
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
Shishira R Maiya, Anubhav Anubhav, Matthew Gwilliam et al.
Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Yubin Wang, Xinyang Jiang, De Cheng et al.
Let Go of Your Labels with Unsupervised Transfer
Artyom Gadetsky, Yulun Jiang, Maria Brbic
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang, Yi Luan, Hexiang Hu et al.
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie, Yehui Tang, Ning Ding et al.
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang et al.
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie, Polina Kirichenko, Mark Ibrahim et al.
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Zhening Huang, Xiaoyang Wu, Xi Chen et al.
Open-Vocabulary Calibration for Fine-tuned CLIP
Shuoyuan Wang, Jindong Wang, Guoqing Wang et al.
Open Vocabulary Multi-Label Video Classification
Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu, Xinyun Zhang, Peng Xu et al.
Position: The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang et al.
Quantized Prompt for Efficient Generalization of Vision-Language Models
Tianxiang Hao, Xiaohan Ding, Juexiao Feng et al.
Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization
Jian Liang, Sheng, Zhengbo Wang et al.
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Ming Nie, Renyuan Peng, Chunwei Wang et al.
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Xuantong Liu, Tianyang Hu, Wenjia Wang et al.
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Dahun Kim, Anelia Angelova, Weicheng Kuo
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin, Xinyue Chen, Deepak Pathak et al.