TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models

0citations
Project
0
Citations
#2007
in ICLR 2025
of 3827 papers
7
Authors
4
Data Points

Abstract

How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution istext-to-image retrievalfrom an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs intext-to-image generationhave made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing aunifiedframework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks,i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.

Citation History

Jan 25, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0