2025 "vision-language models" Papers
239 papers found • Page 3 of 5
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
Xin Dong, Shichao Dong, Jin Wang et al.
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Hongxu chen, Zhen Wang, Runshi Li et al.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Junsung Park, Jungbeom Lee, Jongyoon Song et al.
Language-Assisted Feature Transformation for Anomaly Detection
EungGu Yun, Heonjin Ha, Yeongwoo Nam et al.
Large (Vision) Language Models are Unsupervised In-Context Learners
Artyom Gadetsky, Andrei Atanov, Yulun Jiang et al.
LaViDa: A Large Diffusion Model for Vision-Language Understanding
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal et al.
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Fan-Yun Sun, Weiyu Liu, Siyi Gu et al.
Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization
Hao Zheng, Jingjun Yi, Qi Bi et al.
Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Chenyu Zhou, Mengdan Zhang, Peixian Chen et al.
Lightweight Neural App Control
Filippos Christianos, Georgios Papoudakis, Thomas Coste et al.
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Weijia Shi, Xiaochuang Han, Chunting Zhou et al.
Locality Alignment Improves Vision-Language Models
Ian Covert, Tony Sun, James Y Zou et al.
Locality-Aware Zero-Shot Human-Object Interaction Detection
Sanghyun Kim, Deunsol Jung, Minsu Cho
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang, Jiaqi Li, zihan jia et al.
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
Yongkang Li, Tianheng Cheng, Bin Feng et al.
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Jiacheng Chen, Tianhao Liang, Sherman Siu et al.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong, Jiyun Park, Wencke Liermann et al.
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu et al.
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Lukas Aichberger, Alasdair Paren, Guohao Li et al.
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
Yang Tian, Zheng Lu, Mingqi Gao et al.
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying, Ruiping Liu, Chongyan Chen et al.
Modeling dynamic social vision highlights gaps between deep learning and humans
Kathy Garcia, Emalie McMahon, Colin Conwell et al.
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.
Multi-Label Test-Time Adaptation with Bound Entropy Minimization
Xiangyu Wu, Feng Yu, Yang Yang et al.
Multimodal Causal Reasoning for UAV Object Detection
Nianxin Li, Mao Ye, Lihua Zhou et al.
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao, Christian So, Theodoros Tsiligkaridis et al.
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.
MUNBa: Machine Unlearning via Nash Bargaining
Jing Wu, Mehrtash Harandi
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
Rongchang Xie, Chen Du, Ping Song et al.
Noisy Test-Time Adaptation in Vision-Language Models
Chentao Cao, Zhun Zhong, (Andrew) Zhanke Zhou et al.
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Davide Berasi, Matteo Farina, Massimiliano Mancini et al.
NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI
Cosmin Bercea, Jun Li, Philipp Raffler et al.
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Wei Suo, Lijun Zhang, Mengyang Sun et al.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou et al.
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Zheyu Zhang, Ziqi Pang, Shixing Chen et al.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu et al.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun et al.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin et al.
OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
OS-ATLAS: Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Atharva Gundawar, Som Sagar, Ransalu Senanayake
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Navve Wasserman, Noam Rotstein, Roy Ganz et al.
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
Zhihao ZHU, Yifan Zheng, Siyu Pan et al.