Poster "vision-language models" Papers
475 papers found • Page 2 of 10
Conference
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang, Jingyan Jiang, Qinting Jiang et al.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li et al.
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
Seung Hyun Lee, Jijun jiang, Yiran Xu et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci et al.
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Aniket Rege, Zinnia Nie, Unmesh Raskar et al.
CURV: Coherent Uncertainty-Aware Reasoning in Vision-Language Models for X-Ray Report Generation
Ziao Wang, Sixing Yan, Kejing Yin et al.
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Chuhan ZHANG, Chaoyang Zhu, Pingcheng Dong et al.
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Junjie Wang, BIN CHEN, Yulin Li et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
Zhen Qu, Xian Tao, Xinyi Gong et al.
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Tao Zhang, Cheng Da, Kun Ding et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
Chanyoung Kim, Dayun Ju, Woojung Han et al.
Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning
Yilun Li, Miaomiao Cheng, Xu Han et al.
DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz et al.
Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns
Menghao Zhang, Huazheng Wang, Pengfei Ren et al.
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu, Anette Frank
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Tianhong Zhou, xu yin, Yingtao Zhu et al.
DS-VLM: Diffusion Supervision Vision Language Model
Zhen Sun, Yunhang Shen, Jie Li et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
Dual-Process Image Generation
Grace Luo, Jonathan Granskog, Aleksander Holynski et al.
DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong et al.
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.
Each Complexity Deserves a Pruning Policy
Hanshi Wang, Yuhao Xu, Zekun Xu et al.
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Ronghao Dang, Yuqian Yuan, Wenqi Zhang et al.
EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Jeonghyeon Kim, Sangheum Hwang
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Yucheng Shi, Quanzheng Li, Jin Sun et al.
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon, Kyle Min, Jy-yong Sohn
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen et al.
Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding
Yixiong Fang, Ziran Yang, Zhaorun Chen et al.
Enhancing Vision-Language Model with Unmasked Token Alignment
Hongsheng Li, Jihao Liu, Boxiao Liu et al.
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.
Evaluating Model Perception of Color Illusions in Photorealistic Scenes
Lingjun Mao, Zineng Tang, Alane Suhr
Evaluating Vision-Language Models as Evaluators in Path Planning
Mohamed Aghzal, Xiang Yue, Erion Plaku et al.
EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution
Zhebei Shen, Qifan Yu, Juncheng Li et al.
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
Xiao Yu, Baolin Peng, Vineeth Vajipey et al.
Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification
Zequn Zeng, Yudi Su, Jianqiao Sun et al.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Seogkyu Jeon, Kibeom Hong, Hyeran Byun
Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
Li Ju, Max Andersson, Stina Fredriksson et al.
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao, Bryan Hooi, Jun Liu et al.
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
Xinhua Lu, Runhe Lai, Yanqi Wu et al.
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey et al.