"vision-language models" Papers

570 papers found • Page 5 of 12

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.

ICLR 2025arXiv:2409.15477
20
citations

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu et al.

ICLR 2025arXiv:2410.10563
30
citations

MemEIC: A Step Toward Continual and Compositional Knowledge Editing

Jin Seong, Jiyun Park, Wencke Liermann et al.

NEURIPS 2025arXiv:2510.25798

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.

ICLR 2025arXiv:2410.17637
22
citations

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu et al.

NEURIPS 2025arXiv:2506.22434

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA

Jian Lan, Diego Frassinelli, Barbara Plank

AAAI 2025paperarXiv:2410.02773
3
citations

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

Wenxuan Bao, Ruxi Deng, Jingrui He

NEURIPS 2025arXiv:2510.22127
1
citations

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.

NEURIPS 2025arXiv:2506.05331
24
citations

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Lukas Aichberger, Alasdair Paren, Guohao Li et al.

NEURIPS 2025arXiv:2503.10809
10
citations

MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers

Yang Tian, Zheng Lu, Mingqi Gao et al.

ICCV 2025arXiv:2503.16856
2
citations

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Jin Zhang, Ruiheng Zhang, Zhe Cao et al.

NEURIPS 2025

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Fanqing Meng, Jin Wang, Chuanhao Li et al.

ICLR 2025arXiv:2408.02718
48
citations

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.

NEURIPS 2025arXiv:2510.26937

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen et al.

NEURIPS 2025arXiv:2510.11520

Modality-Specialized Synergizers for Interleaved Vision-Language Generalists

Zhiyang Xu, Minqian Liu, Ying Shen et al.

ICLR 2025arXiv:2407.03604
8
citations

Modeling dynamic social vision highlights gaps between deep learning and humans

Kathy Garcia, Emalie McMahon, Colin Conwell et al.

ICLR 2025

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.

ICLR 2025arXiv:2410.08182
30
citations

Multi-Label Test-Time Adaptation with Bound Entropy Minimization

Xiangyu Wu, Feng Yu, Yang Yang et al.

ICLR 2025arXiv:2502.03777
5
citations

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Zhi Gao, Bofei Zhang, Pengxiang Li et al.

ICLR 2025arXiv:2412.15606
38
citations

Multimodal Causal Reasoning for UAV Object Detection

Nianxin Li, Mao Ye, Lihua Zhou et al.

NEURIPS 2025

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

Jaewoo Jeong, Seohee Lee, Daehee Park et al.

CVPR 2025arXiv:2503.22201
8
citations

Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma, Yiran He, Bin Sun et al.

ICCV 2025arXiv:2506.21017
2
citations

Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

Christopher Liao, Christian So, Theodoros Tsiligkaridis et al.

ICLR 2025arXiv:2402.04416
1
citations

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.

NEURIPS 2025oralarXiv:2509.17429

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

Jeongmin Yu, Susang Kim, Kisu Lee et al.

ICCV 2025arXiv:2509.06336

MUNBa: Machine Unlearning via Nash Bargaining

Jing Wu, Mehrtash Harandi

ICCV 2025arXiv:2411.15537
8
citations

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song et al.

ICCV 2025arXiv:2411.17762
27
citations

MVREC: A General Few-shot Defect Classification Model Using Multi-View Region-Context

Shuai Lyu, Rongchen Zhang, Zeqi Ma et al.

AAAI 2025paperarXiv:2412.16897
9
citations

NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Amirhossein Ansari, Ke Wang, Pulei Xiong

ICCV 2025arXiv:2507.09795

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Boming Miao, Chunxiao Li, Xiaoxiao Wang et al.

CVPR 2025arXiv:2411.16503
3
citations

Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Yanghao Wang, Long Chen

NEURIPS 2025arXiv:2508.11330
2
citations

Noisy Test-Time Adaptation in Vision-Language Models

Chentao Cao, Zhun Zhong, (Andrew) Zhanke Zhou et al.

ICLR 2025arXiv:2502.14604
4
citations

Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

Davide Berasi, Matteo Farina, Massimiliano Mancini et al.

CVPR 2025highlightarXiv:2503.17142
4
citations

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea, Jun Li, Philipp Raffler et al.

NEURIPS 2025oral
7
citations

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Le Yang, Ziwei Zheng, Boxu Chen et al.

CVPR 2025arXiv:2412.13817
26
citations

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Wei Suo, Lijun Zhang, Mengyang Sun et al.

CVPR 2025highlightarXiv:2503.00361
16
citations

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou et al.

CVPR 2025arXiv:2412.07626
47
citations

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.

CVPR 2025arXiv:2504.04348
86
citations

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Mingjie Pan, Jiyao Zhang, Tianshu Wu et al.

CVPR 2025highlightarXiv:2501.03841
47
citations

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NEURIPS 2025

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen et al.

NEURIPS 2025oral

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu et al.

NEURIPS 2025spotlightarXiv:2508.09123
37
citations

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NEURIPS 2025arXiv:2507.05255
14
citations

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin et al.

NEURIPS 2025arXiv:2503.17352
16
citations

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li et al.

NEURIPS 2025spotlightarXiv:2507.05427
2
citations

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025arXiv:2503.19755
66
citations

OS-ATLAS: Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.

ICLR 2025
9
citations

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Atharva Gundawar, Som Sagar, Ransalu Senanayake

NEURIPS 2025arXiv:2506.23725
3
citations

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Navve Wasserman, Noam Rotstein, Roy Ganz et al.

CVPR 2025arXiv:2404.18212
29
citations

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye, Xianhan Zeng, Fu Li et al.

ICLR 2025arXiv:2503.07906
17
citations