NEURIPS Poster "vision-language models" Papers
102 papers found • Page 1 of 3
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Hongyuan Dong, Dingkang Yang, Xiao Liang et al.
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg, Naman Deep Singh, Matthias Hein
AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark
Aruna Gauba, Irene Pi, Yunze Man et al.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection
Gaku Morio, Harri Rowlands, Dominik Stammbach et al.
An Information-theoretical Framework for Understanding Out-of-distribution Detection with Pretrained Vision-Language Models
Bo Peng, Jie Lu, Guangquan Zhang et al.
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
Automated Model Discovery via Multi-modal & Multi-step Pipeline
Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.
BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation
Zibo Zhou, Yue Hu, Lingkai Zhang et al.
Bridging the gap to real-world language-grounded visual concept learning
whie jung, Semin Kim, Junee Kim et al.
CF-VLM:CounterFactual Vision-Language Fine-tuning
jusheng zhang, Kaitong Cai, Yijia Fan et al.
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
CURV: Coherent Uncertainty-Aware Reasoning in Vision-Language Models for X-Ray Report Generation
Ziao Wang, Sixing Yan, Kejing Yin et al.
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Tao Zhang, Cheng Da, Kun Ding et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns
Menghao Zhang, Huazheng Wang, Pengfei Ren et al.
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Tianhong Zhou, xu yin, Yingtao Zhu et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.
Each Complexity Deserves a Pruning Policy
Hanshi Wang, Yuhao Xu, Zekun Xu et al.
EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby et al.
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon, Kyle Min, Jy-yong Sohn
Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding
Yixiong Fang, Ziran Yang, Zhaorun Chen et al.
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.
EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution
Zhebei Shen, Qifan Yu, Juncheng Li et al.
Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
Li Ju, Max Andersson, Stina Fredriksson et al.
FedMGP: Personalized Federated Learning with Multi-Group Text-Visual Prompts
Weihao Bo, Yanpeng Sun, Yu Wang et al.
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu et al.
FlySearch: Exploring how vision-language models explore
Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz et al.
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.
GenIR: Generative Visual Feedback for Mental Image Retrieval
Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Pengyue Jia, Seongheon Park, Song Gao et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Seongheon Park, Sharon Li
GoalLadder: Incremental Goal Discovery with Vision-Language Models
Alexey Zakharov, Shimon Whiteson
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization
Shangshu Yu, Wen Li, Xiaotian Sun et al.
HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
Han Liu, Jiaqi Li, Zhi Xu et al.
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao, Bingbing Zhuang, Sparsh Garg et al.
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Jiaye Qian, Ge Zheng, Yuchen Zhu et al.
Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization
Hao Zheng, Jingjun Yi, Qi Bi et al.
LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery
Jerome Quenum, Wen-Han Hsieh, Tsung-Han (Patrick) Wu et al.
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Weijia Shi, Xiaochuang Han, Chunting Zhou et al.
LOMIA: Label-Only Membership Inference Attacks against Pre-trained Large Vision-Language Models
Yihao LIU, Xinqi Lyu, Dong Wang et al.
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang, Jiaqi Li, zihan jia et al.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong, Jiyun Park, Wencke Liermann et al.