NeurIPS "vision-language models" Papers

42 papers found

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Hongyuan Dong, Dingkang Yang, Xiao Liang et al.

NeurIPS 2025posterarXiv:2506.13274
3
citations

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen et al.

NeurIPS 2025posterarXiv:2511.08399

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.

NeurIPS 2025posterarXiv:2502.01341

AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun, Yitong Li, Yuchen Zhuang et al.

NeurIPS 2025posterarXiv:2505.18983
2
citations

Attention! Your Vision Language Model Could Be Maliciously Manipulated

Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.

NeurIPS 2025posterarXiv:2505.19911
3
citations

CrypticBio: A Large Multimodal Dataset for Visually Confusing Species

Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren

NeurIPS 2025oral

Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

Tal Barami, Nimrod Berman, Ilan Naiman et al.

NeurIPS 2025posterarXiv:2510.17313
2
citations

DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models

Fayi Le, Wenwu He, Chentao Cao et al.

NeurIPS 2025poster

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.

NeurIPS 2025posterarXiv:2510.25146
1
citations

Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

Jihoon Kwon, Kyle Min, Jy-yong Sohn

NeurIPS 2025posterarXiv:2510.16540

FlySearch: Exploring how vision-language models explore

Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz et al.

NeurIPS 2025posterarXiv:2506.02896
3
citations

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.

NeurIPS 2025posterarXiv:2504.13169
10
citations

GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.

NeurIPS 2025posterarXiv:2506.06220

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia, Seongheon Park, Song Gao et al.

NeurIPS 2025posterarXiv:2505.13731
3
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NeurIPS 2025poster

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang, Changle Zhou, Jiawei Kong et al.

NeurIPS 2025posterarXiv:2505.19678
6
citations

GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization

Shangshu Yu, Wen Li, Xiaotian Sun et al.

NeurIPS 2025poster

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng, Zhengqin Xu, Qingyang Liu et al.

NeurIPS 2025oralarXiv:2510.20322
1
citations

Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization

Hao Zheng, Jingjun Yi, Qi Bi et al.

NeurIPS 2025poster

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou et al.

NeurIPS 2025posterarXiv:2412.15188
79
citations

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, zihan jia et al.

NeurIPS 2025poster

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu et al.

NeurIPS 2025posterarXiv:2506.22434

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Lukas Aichberger, Alasdair Paren, Guohao Li et al.

NeurIPS 2025posterarXiv:2503.10809
10
citations

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen et al.

NeurIPS 2025posterarXiv:2510.11520

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao et al.

NeurIPS 2025oralarXiv:2509.17429

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NeurIPS 2025poster

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen et al.

NeurIPS 2025oral

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NeurIPS 2025oralarXiv:2504.13180
40
citations

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Huiwon Jang, Sumin Park et al.

NeurIPS 2025posterarXiv:2506.00070
9
citations

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Chunru Lin, Haotian Yuan, Yian Wang et al.

NeurIPS 2025posterarXiv:2506.14763
2
citations

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.

NeurIPS 2025oralarXiv:2509.01907
2
citations

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.

NeurIPS 2025posterarXiv:2510.10160

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.

NeurIPS 2025poster

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.

NeurIPS 2025oralarXiv:2507.02001
8
citations

Text to Sketch Generation with Multi-Styles

Tengjie Li, Shikui Tu, Lei Xu

NeurIPS 2025posterarXiv:2511.04123

Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

jusheng zhang, Yijia Fan, Zimo Wen et al.

NeurIPS 2025poster

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen et al.

NeurIPS 2025poster

Vision-centric Token Compression in Large Language Model

Ling Xing, Alex Jinpeng Wang, Rui Yan et al.

NeurIPS 2025spotlightarXiv:2502.00791
7
citations

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.

NeurIPS 2025posterarXiv:2507.07104
2
citations

Vision Transformers Don't Need Trained Registers

Nicholas Jiang, Amil Dravid, Alexei Efros et al.

NeurIPS 2025spotlightarXiv:2506.08010
12
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NeurIPS 2025posterarXiv:2509.15235
2
citations

Vocabulary-Guided Gait Recognition

Panjian Huang, Saihui Hou, Chunshui Cao et al.

NeurIPS 2025poster