SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

0citations

arXiv:2411.01106

citations

#2434

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Jian Chen Ruiyi Zhang Yufan Zhou Tong Yu Franck Dernoncourt Jiuxiang Gu Ryan Rossi Changyou Chen Tong Sun

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework namedSelf-VisualRetrieval-AugmentedGeneration (SV-RAG), which can broaden horizons ofanyMLLM to support long-document understanding. We demonstrate thatMLLMs themselves can be an effective multimodal retrieverto fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026