NEURIPS "vision-language models" Papers

131 papers found • Page 2 of 3

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng, Zhengqin Xu, Qingyang Liu et al.

NEURIPS 2025oralarXiv:2510.20322
1
citations

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

Manyi Yao, Bingbing Zhuang, Sparsh Garg et al.

NEURIPS 2025posterarXiv:2509.19552
1
citations

Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination

Rakshit Trivedi, Kartik Sharma, David Parkes

NEURIPS 2025oral

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian, Ge Zheng, Yuchen Zhu et al.

NEURIPS 2025posterarXiv:2511.17254
2
citations

LaViDa: A Large Diffusion Model for Vision-Language Understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal et al.

NEURIPS 2025spotlight

Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization

Hao Zheng, Jingjun Yi, Qi Bi et al.

NEURIPS 2025poster

LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery

Jerome Quenum, Wen-Han Hsieh, Tsung-Han (Patrick) Wu et al.

NEURIPS 2025posterarXiv:2505.02829
4
citations

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou et al.

NEURIPS 2025posterarXiv:2412.15188
79
citations

LOMIA: Label-Only Membership Inference Attacks against Pre-trained Large Vision-Language Models

Yihao LIU, Xinqi Lyu, Dong Wang et al.

NEURIPS 2025poster

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, zihan jia et al.

NEURIPS 2025posterarXiv:2602.02341

MemEIC: A Step Toward Continual and Compositional Knowledge Editing

Jin Seong, Jiyun Park, Wencke Liermann et al.

NEURIPS 2025posterarXiv:2510.25798

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu et al.

NEURIPS 2025posterarXiv:2506.22434

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

Wenxuan Bao, Ruxi Deng, Jingrui He

NEURIPS 2025posterarXiv:2510.22127
1
citations

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.

NEURIPS 2025posterarXiv:2506.05331
22
citations

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Lukas Aichberger, Alasdair Paren, Guohao Li et al.

NEURIPS 2025posterarXiv:2503.10809
10
citations

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Jin Zhang, Ruiheng Zhang, Zhe Cao et al.

NEURIPS 2025poster

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.

NEURIPS 2025posterarXiv:2510.26937

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen et al.

NEURIPS 2025posterarXiv:2510.11520

Multimodal Causal Reasoning for UAV Object Detection

Nianxin Li, Mao Ye, Lihua Zhou et al.

NEURIPS 2025poster

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.

NEURIPS 2025oralarXiv:2509.17429

Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Yanghao Wang, Long Chen

NEURIPS 2025posterarXiv:2508.11330
2
citations

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea, Jun Li, Philipp Raffler et al.

NEURIPS 2025oral

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NEURIPS 2025poster

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen et al.

NEURIPS 2025oral

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu et al.

NEURIPS 2025spotlightarXiv:2508.09123
31
citations

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei, Liang Zhao, Jianjian Sun et al.

NEURIPS 2025posterarXiv:2507.05255
14
citations

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin et al.

NEURIPS 2025posterarXiv:2503.17352
15
citations

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting (Ginny) Xiao, Rishabh Kabra, Yuhang Li et al.

NEURIPS 2025spotlightarXiv:2507.05427
2
citations

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Atharva Gundawar, Som Sagar, Ransalu Senanayake

NEURIPS 2025posterarXiv:2506.23725
3
citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
40
citations

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Ao Wang, Hui Chen, Jianchao Tan et al.

NEURIPS 2025posterarXiv:2412.03409
5
citations

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

Ruiqi Wang, Dezhong Zhao, Ziqin Yuan et al.

NEURIPS 2025oralarXiv:2509.15607

QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

Yutong Wang, Haiyu Wang, Sai Qian Zhang

NEURIPS 2025spotlightarXiv:2510.16292
1
citations

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li et al.

NEURIPS 2025posterarXiv:2503.00743
9
citations

QuARI: Query Adaptive Retrieval Improvement

Eric Xing, Abby Stylianou, Robert Pless et al.

NEURIPS 2025posterarXiv:2505.21647

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NEURIPS 2025posterarXiv:2506.01300
28
citations

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan Rodriguez, Haotian Zhang, Abhay Puri et al.

NEURIPS 2025posterarXiv:2505.20793
6
citations

Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

Jiachen Liang, RuiBing Hou, Minyang Hu et al.

NEURIPS 2025posterarXiv:2510.20134
1
citations

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Matvei Popov, Peter Robicheaux, Anish Madan et al.

NEURIPS 2025posterarXiv:2505.20612
16
citations

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Jingkun An, Cheng Chi et al.

NEURIPS 2025posterarXiv:2506.04308
51
citations

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Huiwon Jang, Sumin Park et al.

NEURIPS 2025posterarXiv:2506.00070
9
citations

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Chunru Lin, Haotian Yuan, Yian Wang et al.

NEURIPS 2025posterarXiv:2506.14763
2
citations

Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models

Junhao Dong, Cong Zhang, Xinghua Qu et al.

NEURIPS 2025spotlight

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder, Ondrej Biza, Thomas Weng et al.

NEURIPS 2025oralarXiv:2508.01943

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.

NEURIPS 2025oralarXiv:2509.01907
2
citations

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.

NEURIPS 2025posterarXiv:2510.10160

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi et al.

NEURIPS 2025posterarXiv:2507.07966
38
citations

Selftok-Zero: Reinforcement Learning for Visual Generation via Discrete and Autoregressive Visual Tokens

Bohan Wang, Mingze Zhou, Zhongqi Yue et al.

NEURIPS 2025poster

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding, Ruqi Zhang

NEURIPS 2025posterarXiv:2505.22651
7
citations

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu, Jingwen He, Yi Jin et al.

NEURIPS 2025posterarXiv:2506.21356
7
citations