Poster "vision-language models" Papers

157 papers found • Page 1 of 4

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.

ICLR 2025posterarXiv:2407.18134
12
citations

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Hongyuan Dong, Dingkang Yang, Xiao Liang et al.

NeurIPS 2025posterarXiv:2506.13274
3
citations

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen et al.

NeurIPS 2025posterarXiv:2511.08399

Aligning Visual Contrastive learning models via Preference Optimization

Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.

ICLR 2025posterarXiv:2411.08923
3
citations

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.

NeurIPS 2025posterarXiv:2502.01341

AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun, Yitong Li, Yuchen Zhuang et al.

NeurIPS 2025posterarXiv:2505.18983
2
citations

Attention! Your Vision Language Model Could Be Maliciously Manipulated

Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.

NeurIPS 2025posterarXiv:2505.19911
3
citations

Attribute-based Visual Reprogramming for Vision-Language Models

Chengyi Cai, Zesheng Ye, Lei Feng et al.

ICLR 2025posterarXiv:2501.13982
4
citations

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.

ICCV 2025posterarXiv:2507.04699
3
citations

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Hairui Ren, Fan Tang, He Zhao et al.

CVPR 2025posterarXiv:2504.11930

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Yupeng Hu, Changxing Ding, Chang Sun et al.

ICCV 2025posterarXiv:2507.06510
2
citations

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Shuming Liu, Chen Zhao, Tianqi Xu et al.

CVPR 2025posterarXiv:2503.21483
26
citations

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Ji Qi, Ming Ding, Weihan Wang et al.

ICLR 2025posterarXiv:2402.04236
33
citations

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Di Zhang, Jingdi Lei, Junxian Li et al.

CVPR 2025posterarXiv:2411.18203
30
citations

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.

CVPR 2025posterarXiv:2503.16707
7
citations

Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci et al.

ICLR 2025posterarXiv:2502.04263
15
citations

DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models

Kaishen Wang, Hengrui Gu, Meijun Gao et al.

ICLR 2025poster
7
citations

Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

Tal Barami, Nimrod Berman, Ilan Naiman et al.

NeurIPS 2025posterarXiv:2510.17313
2
citations

Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning

Yilun Li, Miaomiao Cheng, Xu Han et al.

ICLR 2025poster
6
citations

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.

ICLR 2025posterarXiv:2410.17385
40
citations

DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models

Fayi Le, Wenwu He, Chentao Cao et al.

NeurIPS 2025poster

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.

NeurIPS 2025posterarXiv:2510.25146
1
citations

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

CVPR 2025posterarXiv:2409.18042
44
citations

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Yucheng Shi, Quanzheng Li, Jin Sun et al.

ICLR 2025posterarXiv:2502.14044
6
citations

Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

Jihoon Kwon, Kyle Min, Jy-yong Sohn

NeurIPS 2025posterarXiv:2510.16540

Enhancing Vision-Language Model with Unmasked Token Alignment

Hongsheng Li, Jihao Liu, Boxiao Liu et al.

ICLR 2025posterarXiv:2405.19009

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

Xudong Li, Zihao Huang, Yan Zhang et al.

ICCV 2025posterarXiv:2409.05381
2
citations

FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan, Kebin wu, Fatima Albreiki

CVPR 2025posterarXiv:2504.01916
14
citations

FlySearch: Exploring how vision-language models explore

Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz et al.

NeurIPS 2025posterarXiv:2506.02896
3
citations

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Xinli Xu, Wenhang Ge, Dicong Qiu et al.

ICCV 2025posterarXiv:2412.11258
7
citations

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.

NeurIPS 2025posterarXiv:2504.13169
10
citations

GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.

NeurIPS 2025posterarXiv:2506.06220

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia, Seongheon Park, Song Gao et al.

NeurIPS 2025posterarXiv:2505.13731
3
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NeurIPS 2025poster

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu, Quyu Kong, Kechun Xu et al.

CVPR 2025posterarXiv:2504.04744
6
citations

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang, Changle Zhou, Jiawei Kong et al.

NeurIPS 2025posterarXiv:2505.19678
6
citations

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Tong Wei, Yijun Yang, Junliang Xing et al.

ICCV 2025posterarXiv:2503.08525
8
citations

GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization

Shangshu Yu, Wen Li, Xiaotian Sun et al.

NeurIPS 2025poster

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.

CVPR 2025posterarXiv:2409.19425
2
citations

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong, Shichao Dong, Jin Wang et al.

ICCV 2025posterarXiv:2507.05056
3
citations

Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization

Hao Zheng, Jingjun Yi, Qi Bi et al.

NeurIPS 2025poster

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou et al.

NeurIPS 2025posterarXiv:2412.15188
79
citations

Locality Alignment Improves Vision-Language Models

Ian Covert, Tony Sun, James Y Zou et al.

ICLR 2025posterarXiv:2410.11087

Locality-Aware Zero-Shot Human-Object Interaction Detection

Sanghyun Kim, Deunsol Jung, Minsu Cho

CVPR 2025posterarXiv:2505.19503

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, zihan jia et al.

NeurIPS 2025poster

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.

ICLR 2025posterarXiv:2410.17637
19
citations

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu et al.

NeurIPS 2025posterarXiv:2506.22434

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Lukas Aichberger, Alasdair Paren, Guohao Li et al.

NeurIPS 2025posterarXiv:2503.10809
10
citations

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen et al.

NeurIPS 2025posterarXiv:2510.11520

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.

ICLR 2025posterarXiv:2410.08182
29
citations
← PreviousNext →