Tag2Text: Guiding Vision-Language Model via Image Tagging

0citations

arXiv:2303.05657 Project

Citations

#607

in ICLR 2024

of 2297 papers

Authors

Data Points

Authors

Xinyu Huang Youcai Zhang Jinyu Ma Weiwei Tian Rui Feng Yuejie Zhang Yaqian Li Yandong Guo Lei Zhang

Abstract

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with a limited detector, our approach utilizes tags parsed from its paired text to learn an image tagger and meanwhile provides guidance to vision-language models. Given that, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. Strikingly, Tag2Text showcases the ability of a foundational image tagging model, with superior zero-shot performance even comparable to full supervision manner. Moreover, by leveraging tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance.

Citation History

Jan 28, 2026