"vision-language models" Papers

570 papers found • Page 6 of 12

PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation

Zhihao ZHU, Yifan Zheng, Siyu Pan et al.

ICCV 2025arXiv:2508.05976

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

Seunggwan Lee, Hwanhee Jung, ByoungSoo Koh et al.

ICCV 2025arXiv:2503.12834
2
citations

Perception in Reflection

Yana Wei, Liang Zhao, Kangheng Lin et al.

ICML 2025arXiv:2504.07165
8
citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
47
citations

Personalized Preference Fine-tuning of Diffusion Models

Meihua Dang, Anikait Singh, Linqi Zhou et al.

CVPR 2025arXiv:2501.06655
15
citations

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong et al.

ICCV 2025arXiv:2412.08619
2
citations

PoisonedEye: Knowledge Poisoning Attack on Retrieval-Augmented Generation based Large Vision-Language Models

Chenyang Zhang, Xiaoyu Zhang, Jian Lou et al.

ICML 2025
3
citations

Position-Aware Guided Point Cloud Completion with CLIP Model

Feng Zhou, Qi Zhang, Ju Dai et al.

AAAI 2025paperarXiv:2412.08271
2
citations

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai et al.

CVPR 2025arXiv:2504.12717
12
citations

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Ao Wang, Hui Chen, Jianchao Tan et al.

NEURIPS 2025arXiv:2412.03409
6
citations

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

Ruiqi Wang, Dezhong Zhao, Ziqin Yuan et al.

NEURIPS 2025oralarXiv:2509.15607

PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan et al.

ICCV 2025arXiv:2507.08979
2
citations

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Linh Tran, Wei Sun, Stacy Patterson et al.

ICLR 2025arXiv:2501.13904
5
citations

Probabilistic Prototype Calibration of Vision-language Models for Generalized Few-shot Semantic Segmentation

Jie Liu, Jiayi Shen, Pan Zhou et al.

ICCV 2025arXiv:2506.22979
1
citations

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le, Long Hoang Dang, Ngan Hoang Le et al.

AAAI 2025paperarXiv:2412.08125
3
citations

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Yassir Bendou, Amine Ouasfi, Vincent Gripon et al.

CVPR 2025arXiv:2501.11175
8
citations

Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection

Yuguang Yang, Tongfei Chen, Haoyu Huang et al.

ICLR 2025arXiv:2502.16223
2
citations

Proxy Denoising for Source-Free Domain Adaptation

Song Tang, Wenxin Su, Yan Gan et al.

ICLR 2025arXiv:2406.01658
12
citations

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Joey Hong, Anca Dragan, Sergey Levine

ICLR 2025arXiv:2411.05193
8
citations

QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

Yutong Wang, Haiyu Wang, Sai Qian Zhang

NEURIPS 2025spotlightarXiv:2510.16292
1
citations

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li et al.

NEURIPS 2025arXiv:2503.00743
10
citations

QuARI: Query Adaptive Retrieval Improvement

Eric Xing, Abby Stylianou, Robert Pless et al.

NEURIPS 2025arXiv:2505.21647

Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification

Jiaxiang Gou, Luping Ji, Pei Liu et al.

AAAI 2025paperarXiv:2410.10573
9
citations

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu, Yanping Fu, Saike Huang et al.

ICCV 2025arXiv:2507.23734
5
citations

RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models

Youngjun Lee, Doyoung Kim, Junhyeok Kang et al.

ICLR 2025
5
citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NEURIPS 2025arXiv:2506.01300
29
citations

ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

Yuhang Lu, Jiadong Tu, Yuexin Ma et al.

ICCV 2025arXiv:2507.12499
6
citations

Re-Aligning Language to Visual Objects with an Agentic Workflow

Yuming Chen, Jiangyan Feng, Haodong Zhang et al.

ICLR 2025arXiv:2503.23508
3
citations

Realistic Test-Time Adaptation of Vision-Language Models

Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer et al.

CVPR 2025highlightarXiv:2501.03729
6
citations

Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering

Sheng Liu, Haotian Ye, James Y Zou

ICLR 2025
29
citations

Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation

Jihyo Kim, Seulbi Lee, Sangheum Hwang

ICLR 2025arXiv:2410.14975
3
citations

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan Rodriguez, Haotian Zhang, Abhay Puri et al.

NEURIPS 2025arXiv:2505.20793
9
citations

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Jinhong Deng, Yuhang Yang, Wen Li et al.

CVPR 2025arXiv:2411.15851
11
citations

Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

Xiao Guo, Xiufeng Song, Yue Zhang et al.

CVPR 2025arXiv:2503.20188
26
citations

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu et al.

CVPR 2025arXiv:2411.14901
9
citations

Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

Jiachen Liang, RuiBing Hou, Minyang Hu et al.

NEURIPS 2025arXiv:2510.20134
1
citations

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

Anurag Ghosh, Shen Zheng, Robert Tamburo et al.

ICCV 2025arXiv:2406.07661
11
citations

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Matvei Popov, Peter Robicheaux, Anish Madan et al.

NEURIPS 2025arXiv:2505.20612
17
citations

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang, Xinyi Chen, Yilun Chen et al.

CVPR 2025arXiv:2504.21530
17
citations

RoboPearls: Editable Video Simulation for Robot Manipulation

Tao Tang, Likui Zhang, Youpeng Wen et al.

ICCV 2025arXiv:2506.22756
2
citations

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Jingkun An, Cheng Chi et al.

NEURIPS 2025arXiv:2506.04308
58
citations

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay et al.

CVPR 2025arXiv:2411.16537
90
citations

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Huiwon Jang, Sumin Park et al.

NEURIPS 2025arXiv:2506.00070
10
citations

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Chunru Lin, Haotian Yuan, Yian Wang et al.

NEURIPS 2025arXiv:2506.14763
3
citations

Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models

Junhao Dong, Cong Zhang, Xinghua Qu et al.

NEURIPS 2025spotlight

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder, Ondrej Biza, Thomas Weng et al.

NEURIPS 2025oralarXiv:2508.01943

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.

NEURIPS 2025oralarXiv:2509.01907
2
citations

R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

Lijun Sheng, Jian Liang, Zilei Wang et al.

CVPR 2025arXiv:2504.11195
15
citations

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jiawei Wang, Yushen Zuo, Yuanjun Chai et al.

ICCV 2025arXiv:2504.01308

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.

NEURIPS 2025arXiv:2510.10160