TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models

0citations

Project

Citations

#2007

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Leigang Qu Haochuan Li Tan Wang Wenjie Wang Yongqi Li Liqiang Nie Tat-Seng Chua

Abstract

How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution istext-to-image retrievalfrom an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs intext-to-image generationhave made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing aunifiedframework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks,i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026