Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

1citations

arXiv:2512.20174

Citations

#1086

in CVPR 2025

of 2873 papers

Authors

Data Points

Authors

Hao Guo Xugong Qin Jun Jie Ou Yang peng zhang Gangyan Zeng Yubo Li Hailun Lin

Topics

document image retrieval vision-language models visual document understanding natural language queries contrastive learning zero-shot evaluation ocr-free models semantic retrieval

Abstract

Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 31, 2026

1+1