"vision-language models" Papers
570 papers found • Page 1 of 12
Conference
$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.
3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
Xiaotang Gai, Jiaxiang Liu, Yichen Li et al.
3D-SPATIAL MULTIMODAL MEMORY
Xueyan Zou, Yuchen Song, Ri-Zhao Qiu et al.
Active Data Curation Effectively Distills Large-Scale Multimodal Models
Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem et al.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Hongyuan Dong, Dingkang Yang, Xiao Liang et al.
Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration
Chao Wang, Hehe Fan, Huichen Yang et al.
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg, Naman Deep Singh, Matthias Hein
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin et al.
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan, Hanqing Liu, Yao Huang et al.
AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark
Aruna Gauba, Irene Pi, Yunze Man et al.
AgroBench: Vision-Language Model Benchmark in Agriculture
Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka et al.
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu, Siyuan Meng, Yanting Gao et al.
Aligning Visual Contrastive learning models via Preference Optimization
Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection
Gaku Morio, Harri Rowlands, Dominik Stammbach et al.
An Information-theoretical Framework for Understanding Out-of-distribution Detection with Pretrained Vision-Language Models
Bo Peng, Jie Lu, Guangquan Zhang et al.
An Intelligent Agentic System for Complex Image Restoration Problems
Kaiwen Zhu, Jinjin Gu, Zhiyuan You et al.
Approximate Domain Unlearning for Vision-Language Models
Kodai Kawamura, Yuta Goto, Rintaro Yanagi et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao, Yizeng Han, Jiasheng Tang et al.
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
Attribute-based Visual Reprogramming for Vision-Language Models
Chengyi Cai, Zesheng Ye, Lei Feng et al.
Automated Model Discovery via Multi-modal & Multi-step Pipeline
Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.
A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter
Zirun Guo, Xize Cheng, Yangyang Wu et al.
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
Zhantao Yang, Ruili Feng, Keyu Yan et al.
Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Zhen Qu, Xian Tao, Xinyi Gong et al.
Bayesian Test-Time Adaptation for Vision-Language Models
Lihua Zhou, Mao Ye, Shuaifeng Li et al.
BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation
Zibo Zhou, Yue Hu, Lingkai Zhang et al.
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren, Fan Tang, He Zhao et al.
Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Yupeng Hu, Changxing Ding, Chang Sun et al.
Black Sheep in the Herd: Playing with Spuriously Correlated Attributes for Vision-Language Recognition
Xinyu Tian, Shu Zou, Zhaoyuan Yang et al.
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao, Xiang Zheng, Lin Luo et al.
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu, Chen Zhao, Tianqi Xu et al.
Boosting the visual interpretability of CLIP via adversarial fine-tuning
Shizhan Gong, Haoyu LEI, Qi Dou et al.
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
Chunxiao Li, Xiaoxiao Wang, Meiling Li et al.
Bridging the gap to real-world language-grounded visual concept learning
whie jung, Semin Kim, Junee Kim et al.
Causality-guided Prompt Learning for Vision-language Models via Visual Granulation
Mengyu Gao, Qiulei Dong
C-CLIP: Multimodal Continual Learning for Vision-Language Model
Wenzhuo Liu, Fei Zhu, Longhui Wei et al.
CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting
Lei Tian, Xiaomin Li, Liqian Ma et al.
CF-VLM:CounterFactual Vision-Language Fine-tuning
jusheng zhang, Kaitong Cai, Yijia Fan et al.
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Peng Xie, Yequan Bie, Jianda Mao et al.
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
Ke Niu, Haiyang Yu, Mengyang Zhao et al.