Phased Training for LLM-powered Text Retrieval Models Beyond Data Scaling

0citations

PDF Project

citations

#206

in COLM 2025

of 263 papers

Top Authors

Data Points

Top Authors

Xin Zhang Yanzhao Zhang Wen Xie Dingkun Long Mingxin Li Pengjun Xie Meishan Zhang Wenjie Li Min Zhang

Topics

Text Retrieval Text Embedding Reranking LLM-based Embedding

Abstract

Current efforts in building large language models (LLMs) based general-purpose text retrieval models primarily focus on architectural design and training data scaling. However, significant challenges remain in effectively modeling diverse retrieval tasks and domains, including multi-task conflict, data imbalance, and training efficiency. To address these challenges, we propose a novel phased training framework for text retrieval, featuring: (1) robust foundation modeling with core relevance data, (2) progressive specialization through modular task adaptation, and (3) knowledge fusion via weight interpolation based model merging. This framework simultaneously optimizes both embedding and reranking models through a unified architecture. We also present an efficient yet scalable data synthesis pipeline to expand training data, based on open-source LLMs. These synthetic data can be efficiently incorporated into the phased training framework, enhancing model performance. We identify five distinct types of retrieval tasks, \ie basic relevance retrieval, code retrieval, tool retrieval, complex instruction-based retrieval, as well as reasoning-intensive retrieval, conducting extensive experiments. Our method achieves the best performance across MTEB and various retrieval benchmarks of the five task types. Further analysis demonstrates the effectiveness and efficiency of our proposed training framework and data synthesis pipeline.