2024 "vision-language models" Papers
89 papers found • Page 2 of 2
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie, Yehui Tang, Ning Ding et al.
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang et al.
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie, Polina Kirichenko, Mark Ibrahim et al.
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
Anurag Das, Xinting Hu, Li Jiang et al.
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Zhening Huang, Xiaoyang Wu, Xi Chen et al.
Open-Vocabulary Calibration for Fine-tuned CLIP
Shuoyuan Wang, Jindong Wang, Guoqing Wang et al.
Open Vocabulary Multi-Label Video Classification
Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu, Xinyun Zhang, Peng Xu et al.
Position: The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang et al.
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xinyu Sun, Lizhao Liu, Hongyan Zhi et al.
Quantized Prompt for Efficient Generalization of Vision-Language Models
Tianxiang Hao, Xiaohan Ding, Juexiao Feng et al.
Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization
Jian Liang, Sheng, Zhengbo Wang et al.
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Ming Nie, Renyuan Peng, Chunwei Wang et al.
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Xuantong Liu, Tianyang Hu, Wenjia Wang et al.
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Dahun Kim, Anelia Angelova, Weicheng Kuo
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin, Xinyue Chen, Deepak Pathak et al.
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann, Naman Singh, Francesco Croce et al.
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
ziniu hu, Ahmet Iscen, Aashi Jain et al.
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen et al.
Semantic-Aware Data Augmentation for Text-to-Image Synthesis
Zhaorui Tan, Xi Yang, Kaizhu Huang
Simple Image-Level Classification Improves Open-Vocabulary Object Detection
Ruohuan Fang, Guansong Pang, Xiao Bai
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
Guohao Sun, Can Qin, JIAMINAN WANG et al.
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Qinyu Zhao, Ming Xu, Kartik Gupta et al.
The Hard Positive Truth about Vision-Language Compositionality
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang et al.
Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
Jiaqi Xu, Mengyang Wu, Xiaowei Hu et al.
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Minghang Zheng, Xinhao Cai, Qingchao Chen et al.
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju, Haicheng Wang, Haozhe Cheng et al.
Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Yifei Ming, Sharon Li
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
Pengkun Jiao, Na Zhao, Jingjing Chen et al.
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models
Hao Cheng, Erjia Xiao, Jindong Gu et al.
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
Peng Wu, Xuerong Zhou, Guansong Pang et al.
VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation
Zhen Qu, Xian Tao, Mukesh Prasad et al.
Visual Grounding for Object-Level Generalization in Reinforcement Learning
Haobin Jiang, Zongqing Lu
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Jinhao Li, Haopeng Li, Sarah Erfani et al.
Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning
Kun Ding, Haojian Zhang, Qiang Yu et al.