"vision-language models" Papers
570 papers found • Page 3 of 12
Conference
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Yucheng Shi, Quanzheng Li, Jin Sun et al.
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon, Kyle Min, Jy-yong Sohn
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen et al.
Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding
Yixiong Fang, Ziran Yang, Zhaorun Chen et al.
Enhancing Vision-Language Model with Unmasked Token Alignment
Hongsheng Li, Jihao Liu, Boxiao Liu et al.
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.
Evading Data Provenance in Deep Neural Networks
Hongyu Zhu, Sichu Liang, Wenwen Wang et al.
Evaluating Model Perception of Color Illusions in Photorealistic Scenes
Lingjun Mao, Zineng Tang, Alane Suhr
Evaluating Vision-Language Models as Evaluators in Path Planning
Mohamed Aghzal, Xiang Yue, Erion Plaku et al.
EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution
Zhebei Shen, Qifan Yu, Juncheng Li et al.
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
Xiao Yu, Baolin Peng, Vineeth Vajipey et al.
Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification
Zequn Zeng, Yudi Su, Jianqiao Sun et al.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Seogkyu Jeon, Kibeom Hong, Hyeran Byun
Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
Li Ju, Max Andersson, Stina Fredriksson et al.
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao, Bryan Hooi, Jun Liu et al.
Extract Free Dense Misalignment from CLIP
JeongYeon Nam, Jinbae Im, Wonjae Kim et al.
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
Xinhua Lu, Runhe Lai, Yanqi Wu et al.
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey et al.
FedMGP: Personalized Federated Learning with Multi-Group Text-Visual Prompts
Weihao Bo, Yanpeng Sun, Yu Wang et al.
FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
Mainak Singha, Subhankar Roy, Sarthak Mehrotra et al.
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models
Xudong Li, Zihao Huang, Yan Zhang et al.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong, Delong Ran, Jinyuan Liu et al.
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua, Qing Liu, Lingzhi Zhang et al.
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
Jiawen Zhu, YEW-SOON ONG, Chunhua Shen et al.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li, Meng Tian, Zhenyu Lin et al.
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu et al.
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
Mothilal Asokan, Kebin wu, Fatima Albreiki
FlySearch: Exploring how vision-language models explore
Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz et al.
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Mingyang Song, Xiaoye Qu, Jiawei Zhou et al.
From Panels to Prose: Generating Literary Narratives from Comics
Ragav Sachdeva, Andrew Zisserman
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
Tianyu Chen, Xingcheng Fu, Yisen Gao et al.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu, Wenhang Ge, Dicong Qiu et al.
Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
Zhaochong An, Guolei Sun, Yun Liu et al.
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.
Generate, Transduct, Adapt: Iterative Transduction with VLMs
Oindrila Saha, Logan Lawrence, Grant Horn et al.
Generating CAD Code with Vision-Language Models for 3D Designs
Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Haider Zaidi et al.
Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency
Xiangyu Guo, Zhanqian Wu, Kaixin Xiong et al.
GenIR: Generative Visual Feedback for Mental Image Retrieval
Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Pengyue Jia, Seongheon Park, Song Gao et al.
GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
Haoqiang Kang, Enna Sachdeva, Piyush Gupta et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Wei Deng, Mengshi Qi, Huadong Ma
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Seongheon Park, Sharon Li
GoalLadder: Incremental Goal Discovery with Vision-Language Models
Alexey Zakharov, Shimon Whiteson
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu, Quyu Kong, Kechun Xu et al.
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Rui Hu, Yuxuan Zhang, Lianghui Zhu et al.
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Yongshuo Zong, Qin ZHANG, DONGSHENG An et al.