"vision-language models" Papers
570 papers found • Page 5 of 12
Conference
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi et al.
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Jiacheng Chen, Tianhao Liang, Sherman Siu et al.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong, Jiyun Park, Wencke Liermann et al.
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu et al.
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank
Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions
Wenxuan Bao, Ruxi Deng, Jingrui He
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Lukas Aichberger, Alasdair Paren, Guohao Li et al.
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
Yang Tian, Zheng Lu, Mingqi Gao et al.
MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes
Jin Zhang, Ruiheng Zhang, Zhe Cao et al.
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng, Jin Wang, Chuanhao Li et al.
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying, Ruiping Liu, Chongyan Chen et al.
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists
Zhiyang Xu, Minqian Liu, Ying Shen et al.
Modeling dynamic social vision highlights gaps between deep learning and humans
Kathy Garcia, Emalie McMahon, Colin Conwell et al.
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.
Multi-Label Test-Time Adaptation with Bound Entropy Minimization
Xiangyu Wu, Feng Yu, Yang Yang et al.
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
Zhi Gao, Bofei Zhang, Pengxiang Li et al.
Multimodal Causal Reasoning for UAV Object Detection
Nianxin Li, Mao Ye, Lihua Zhou et al.
Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
Jaewoo Jeong, Seohee Lee, Daehee Park et al.
Multimodal Prompt Alignment for Facial Expression Recognition
Fuyan Ma, Yiran He, Bin Sun et al.
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao, Christian So, Theodoros Tsiligkaridis et al.
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.
Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing
Jeongmin Yu, Susang Kim, Kisu Lee et al.
MUNBa: Machine Unlearning via Nash Bargaining
Jing Wu, Mehrtash Harandi
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
Rongchang Xie, Chen Du, Ping Song et al.
MVREC: A General Few-shot Defect Classification Model Using Multi-View Region-Context
Shuai Lyu, Rongchen Zhang, Zeqi Ma et al.
NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
Amirhossein Ansari, Ke Wang, Pulei Xiong
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
Boming Miao, Chunxiao Li, Xiaoxiao Wang et al.
Noise Matters: Optimizing Matching Noise for Diffusion Classifiers
Yanghao Wang, Long Chen
Noisy Test-Time Adaptation in Vision-Language Models
Chentao Cao, Zhun Zhong, (Andrew) Zhanke Zhou et al.
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Davide Berasi, Matteo Farina, Massimiliano Mancini et al.
NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI
Cosmin Bercea, Jun Li, Philipp Raffler et al.
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang, Ziwei Zheng, Boxu Chen et al.
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Wei Suo, Lijun Zhang, Mengyang Sun et al.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou et al.
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan, Jiyao Zhang, Tianshu Wu et al.
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Zheyu Zhang, Ziqi Pang, Shixing Chen et al.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu et al.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun et al.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin et al.
OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
OS-ATLAS: Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Atharva Gundawar, Som Sagar, Ransalu Senanayake
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Navve Wasserman, Noam Rotstein, Roy Ganz et al.
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Qinghao Ye, Xianhan Zeng, Fu Li et al.