"vision-language models" Papers

125 papers found • Page 1 of 3

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.

ICLR 2025posterarXiv:2407.18134
12
citations

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen et al.

NeurIPS 2025posterarXiv:2511.08399

Aligning Visual Contrastive learning models via Preference Optimization

Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.

ICLR 2025posterarXiv:2411.08923
3
citations

AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun, Yitong Li, Yuchen Zhuang et al.

NeurIPS 2025posterarXiv:2505.18983
2
citations

Attention! Your Vision Language Model Could Be Maliciously Manipulated

Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.

NeurIPS 2025posterarXiv:2505.19911
3
citations

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.

ICCV 2025posterarXiv:2507.04699
3
citations

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Hairui Ren, Fan Tang, He Zhao et al.

CVPR 2025posterarXiv:2504.11930

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Ji Qi, Ming Ding, Weihan Wang et al.

ICLR 2025posterarXiv:2402.04236
33
citations

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Di Zhang, Jingdi Lei, Junxian Li et al.

CVPR 2025posterarXiv:2411.18203
30
citations

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.

CVPR 2025posterarXiv:2503.16707
7
citations

CrypticBio: A Large Multimodal Dataset for Visually Confusing Species

Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren

NeurIPS 2025oral

DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models

Kaishen Wang, Hengrui Gu, Meijun Gao et al.

ICLR 2025poster
7
citations

Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

Tal Barami, Nimrod Berman, Ilan Naiman et al.

NeurIPS 2025posterarXiv:2510.17313
2
citations

DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models

Fayi Le, Wenwu He, Chentao Cao et al.

NeurIPS 2025poster

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.

NeurIPS 2025posterarXiv:2510.25146
1
citations

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

Xudong Li, Zihao Huang, Yan Zhang et al.

ICCV 2025posterarXiv:2409.05381
2
citations

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Xinli Xu, Wenhang Ge, Dicong Qiu et al.

ICCV 2025posterarXiv:2412.11258
7
citations

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.

NeurIPS 2025posterarXiv:2504.13169
10
citations

GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.

NeurIPS 2025posterarXiv:2506.06220

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia, Seongheon Park, Song Gao et al.

NeurIPS 2025posterarXiv:2505.13731
3
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NeurIPS 2025poster

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu, Quyu Kong, Kechun Xu et al.

CVPR 2025posterarXiv:2504.04744
6
citations

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang, Changle Zhou, Jiawei Kong et al.

NeurIPS 2025posterarXiv:2505.19678
6
citations

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Tong Wei, Yijun Yang, Junliang Xing et al.

ICCV 2025posterarXiv:2503.08525
8
citations

GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization

Shangshu Yu, Wen Li, Xiaotian Sun et al.

NeurIPS 2025poster

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong, Shichao Dong, Jin Wang et al.

ICCV 2025posterarXiv:2507.05056
3
citations

Locality-Aware Zero-Shot Human-Object Interaction Detection

Sanghyun Kim, Deunsol Jung, Minsu Cho

CVPR 2025posterarXiv:2505.19503

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, zihan jia et al.

NeurIPS 2025poster

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu et al.

NeurIPS 2025posterarXiv:2506.22434

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Lukas Aichberger, Alasdair Paren, Guohao Li et al.

NeurIPS 2025posterarXiv:2503.10809
10
citations

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.

ICLR 2025posterarXiv:2410.08182
29
citations

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen et al.

NeurIPS 2025oral

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025posterarXiv:2503.19755
62
citations

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Navve Wasserman, Noam Rotstein, Roy Ganz et al.

CVPR 2025posterarXiv:2404.18212
29
citations

RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models

Youngjun Lee, Doyoung Kim, Junhyeok Kang et al.

ICLR 2025poster
5
citations

Realistic Test-Time Adaptation of Vision-Language Models

Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer et al.

CVPR 2025highlightarXiv:2501.03729

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Chunru Lin, Haotian Yuan, Yian Wang et al.

NeurIPS 2025posterarXiv:2506.14763
2
citations

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.

NeurIPS 2025oralarXiv:2509.01907
2
citations

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.

NeurIPS 2025posterarXiv:2510.10160

SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

Yangyang Guo, Mohan Kankanhalli

ICCV 2025posterarXiv:2411.09126
3
citations

Semantic Temporal Abstraction via Vision-Language Model Guidance for Efficient Reinforcement Learning

Tian-Shuo Liu, Xu-Hui Liu, Ruifeng Chen et al.

ICLR 2025oral

Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation

Yong Liu, Song-Li Wu, Sule Bai et al.

ICCV 2025posterarXiv:2506.16058
2
citations

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.

NeurIPS 2025poster

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

SOMESH SINGH, Harini S I, Yaman Singla et al.

ICLR 2025poster
2
citations

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.

NeurIPS 2025oralarXiv:2507.02001
8
citations

Text to Sketch Generation with Multi-Styles

Tengjie Li, Shikui Tu, Lei Xu

NeurIPS 2025posterarXiv:2511.04123

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Young Kyun Jang, Ser-Nam Lim

ICCV 2025posterarXiv:2405.14715
2
citations

Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

jusheng zhang, Yijia Fan, Zimo Wen et al.

NeurIPS 2025poster

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang, Fei Wei, Yong Wang et al.

ICCV 2025posterarXiv:2507.00721

Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan et al.

NeurIPS 2025spotlightarXiv:2502.00791
7
citations
← PreviousNext →