Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

3citations
Project
3
Citations
#1150
in CVPR 2025
of 2873 papers
7
Authors
5
Data Points

Abstract

Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its gener-alization mechanism grasped massive research interests in the community. While existing studies limited in the scope of pre-training knowledge, hardly underpinned its generalization to countless open-world concepts absent from the pre-training regime. This paper dives into such Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, with regards to OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis evidences that despite image features of OOP concepts born with significant category margins, their zero-shot transfer significantly fails due to the poor image-text alignment. To this, we elaborate the "name-tuning" methodology with its theoretical merits in terms of OOP generalization, then propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. LAION-Beyond dataset and codes: http://m-huangx.github.io/laion_beyond/.

Citation History

Jan 24, 2026
0
Jan 26, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Feb 2, 2026
3+3