"vision-language models" Papers
570 papers found • Page 11 of 12
Conference
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang et al.
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie, Polina Kirichenko, Mark Ibrahim et al.
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal, Aditya Avinash, Neil Alldrin et al.
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
Anurag Das, Xinting Hu, Li Jiang et al.
Multi-modal Relation Distillation for Unified 3D Representation Learning
Huiqun Wang, Yiping Bao, Panwang Pan et al.
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Lin Li, Haoyan Guan, Jianing Qiu et al.
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno D', Incà, Elia Peruzzo et al.
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Zhening Huang, Xiaoyang Wu, Xi Chen et al.
Open-Set Recognition in the Age of Vision-Language Models
Dimity Miller, Niko Suenderhauf, Alex Kenna et al.
Open-Vocabulary Calibration for Fine-tuned CLIP
Shuoyuan Wang, Jindong Wang, Guoqing Wang et al.
Open Vocabulary Multi-Label Video Classification
Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.
PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
Ardian Umam, Cheng-Kun Yang, Min-Hung Chen et al.
Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs
Shi Liu, Kecheng Zheng, Wei Chen
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek et al.
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu, Xinyun Zhang, Peng Xu et al.
Position: The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang et al.
PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization
Zining Chen, Weiqiu Wang, Zhicheng Zhao et al.
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xinyu Sun, Lizhao Liu, Hongyan Zhi et al.
Quantized Prompt for Efficient Generalization of Vision-Language Models
Tianxiang Hao, Xiaohan Ding, Juexiao Feng et al.
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.
Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization
Jian Liang, Sheng, Zhengbo Wang et al.
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Ming Nie, Renyuan Peng, Chunwei Wang et al.
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Xuantong Liu, Tianyang Hu, Wenjia Wang et al.
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Jinrui Zhang, Teng Wang, Haigang Zhang et al.
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Dahun Kim, Anelia Angelova, Weicheng Kuo
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale et al.
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin, Xinyue Chen, Deepak Pathak et al.
Robust Calibration of Large Vision-Language Adapters
Balamurali Murugesan, Julio Silva-Rodríguez, Ismail Ben Ayed et al.
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann, Naman Singh, Francesco Croce et al.
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.
SAI3D: Segment Any Instance in 3D Scenes
Yingda Yin, Yuzheng Liu, Yang Xiao et al.
Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic
Sachin Goyal, Pratyush Maini, Zachary Lipton et al.
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
ziniu hu, Ahmet Iscen, Aashi Jain et al.
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Bin Xie, Jiale Cao, Jin Xie et al.
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen et al.
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
Kaiwen Cai, ZheKai Duan, Gaowen Liu et al.
Semantic-Aware Data Augmentation for Text-to-Image Synthesis
Zhaorui Tan, Xi Yang, Kaizhu Huang
Semantics-aware Motion Retargeting with Vision-Language Models
Haodong Zhang, ZhiKe Chen, Haocheng Xu et al.
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance
Lukas Hoyer, David Tan, Muhammad Ferjad Naeem et al.
Simple Image-Level Classification Improves Open-Vocabulary Object Detection
Ruohuan Fang, Guansong Pang, Xiao Bai
SonicVisionLM: Playing Sound with Vision Language Models
Zhifeng Xie, Shengye Yu, Qile He et al.
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
Xinyao Li, Yuke Li, Zhekai Du et al.
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
Guohao Sun, Can Qin, JIAMINAN WANG et al.
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing
Xudong Wang, Ke-Yue Zhang, Taiping Yao et al.
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Qinyu Zhao, Ming Xu, Kartik Gupta et al.
The Hard Positive Truth about Vision-Language Compositionality
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang et al.
Towards Neuro-Symbolic Video Understanding
Minkyu Choi, Harsh Goel, Mohammad Omama et al.