NEURIPS "vision-language models" Papers
131 papers found • Page 2 of 3
HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
Zelin Peng, Zhengqin Xu, Qingyang Liu et al.
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao, Bingbing Zhuang, Sparsh Garg et al.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David Parkes
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Jiaye Qian, Ge Zheng, Yuchen Zhu et al.
LaViDa: A Large Diffusion Model for Vision-Language Understanding
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal et al.
Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization
Hao Zheng, Jingjun Yi, Qi Bi et al.
LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery
Jerome Quenum, Wen-Han Hsieh, Tsung-Han (Patrick) Wu et al.
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Weijia Shi, Xiaochuang Han, Chunting Zhou et al.
LOMIA: Label-Only Membership Inference Attacks against Pre-trained Large Vision-Language Models
Yihao LIU, Xinqi Lyu, Dong Wang et al.
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang, Jiaqi Li, zihan jia et al.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong, Jiyun Park, Wencke Liermann et al.
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu et al.
Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions
Wenxuan Bao, Ruxi Deng, Jingrui He
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Lukas Aichberger, Alasdair Paren, Guohao Li et al.
MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes
Jin Zhang, Ruiheng Zhang, Zhe Cao et al.
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying, Ruiping Liu, Chongyan Chen et al.
Multimodal Causal Reasoning for UAV Object Detection
Nianxin Li, Mao Ye, Lihua Zhou et al.
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.
Noise Matters: Optimizing Matching Noise for Diffusion Classifiers
Yanghao Wang, Long Chen
NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI
Cosmin Bercea, Jun Li, Philipp Raffler et al.
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Zheyu Zhang, Ziqi Pang, Shixing Chen et al.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu et al.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun et al.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin et al.
OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li et al.
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Atharva Gundawar, Som Sagar, Ransalu Senanayake
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang, Hui Chen, Jianchao Tan et al.
PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models
Ruiqi Wang, Dezhong Zhao, Ziqin Yuan et al.
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
Yutong Wang, Haiyu Wang, Sai Qian Zhang
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li et al.
QuARI: Query Adaptive Retrieval Improvement
Eric Xing, Abby Stylianou, Robert Pless et al.
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
Yiyang Zhou, Yangfan He, Yaofeng Su et al.
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Juan Rodriguez, Haotian Zhang, Abhay Puri et al.
Revisiting Logit Distributions for Reliable Out-of-Distribution Detection
Jiachen Liang, RuiBing Hou, Minyang Hu et al.
Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Matvei Popov, Peter Robicheaux, Anish Madan et al.
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Enshen Zhou, Jingkun An, Cheng Chi et al.
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
Dongyoung Kim, Huiwon Jang, Sumin Park et al.
RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills
Chunru Lin, Haotian Yuan, Yian Wang et al.
Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models
Junhao Dong, Cong Zhang, Xinghua Qu et al.
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
Philip Schroeder, Ondrej Biza, Thomas Weng et al.
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.
Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi et al.
Selftok-Zero: Reinforcement Learning for Visual Generation via Discrete and Autoregressive Visual Tokens
Bohan Wang, Mingze Zhou, Zhongqi Yue et al.
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Yi Ding, Ruqi Zhang
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu, Jingwen He, Yi Jin et al.