Mitigate the Gap: Improving Cross-Modal Alignment in CLIP

14citations

citations

#813

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Sedigheh Eslami Gerard de Melo

Topics

modality gap cross-modal alignment zero-shot classification vision-language tasks embedding space geometry contrastive language-image pre-training parameter sharing semantic regularization

Abstract

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we propose AlignCLIP, in order to improve the alignment between text and image embeddings, and thereby reduce the modality gap. AlignCLIP increases the cross-modal alignment, and yields gains across several zero-shot and fine-tuning downstream evaluations by sharing the learnable parameters between the modality encoders and a semantically-regularized separation objective function on the uni-modal embeddings. The source code and model checkpoints for reproducing our experiments are available at https://github.com/sarahESL/AlignCLIP.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

14+14