Poster "vision-language models" Papers
157 papers found • Page 1 of 4
$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Hongyuan Dong, Dingkang Yang, Xiao Liang et al.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
Aligning Visual Contrastive learning models via Preference Optimization
Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
Attribute-based Visual Reprogramming for Vision-Language Models
Chengyi Cai, Zesheng Ye, Lei Feng et al.
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren, Fan Tang, He Zhao et al.
Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Yupeng Hu, Changxing Ding, Chang Sun et al.
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu, Chen Zhao, Tianqi Xu et al.
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi, Ming Ding, Weihan Wang et al.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci et al.
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning
Yilun Li, Miaomiao Cheng, Xu Han et al.
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Yucheng Shi, Quanzheng Li, Jin Sun et al.
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon, Kyle Min, Jy-yong Sohn
Enhancing Vision-Language Model with Unmasked Token Alignment
Hongsheng Li, Jihao Liu, Boxiao Liu et al.
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models
Xudong Li, Zihao Huang, Yan Zhang et al.
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
Mothilal Asokan, Kebin wu, Fatima Albreiki
FlySearch: Exploring how vision-language models explore
Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz et al.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu, Wenhang Ge, Dicong Qiu et al.
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.
GenIR: Generative Visual Feedback for Mental Image Retrieval
Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Pengyue Jia, Seongheon Park, Song Gao et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu, Quyu Kong, Kechun Xu et al.
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei, Yijun Yang, Junliang Xing et al.
GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization
Shangshu Yu, Wen Li, Xiaotian Sun et al.
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
Xin Dong, Shichao Dong, Jin Wang et al.
Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization
Hao Zheng, Jingjun Yi, Qi Bi et al.
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Weijia Shi, Xiaochuang Han, Chunting Zhou et al.
Locality Alignment Improves Vision-Language Models
Ian Covert, Tony Sun, James Y Zou et al.
Locality-Aware Zero-Shot Human-Object Interaction Detection
Sanghyun Kim, Deunsol Jung, Minsu Cho
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang, Jiaqi Li, zihan jia et al.
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu et al.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Lukas Aichberger, Alasdair Paren, Guohao Li et al.
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying, Ruiping Liu, Chongyan Chen et al.
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.