SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

0
citations
#2434
in ICLR 2025
of 3827 papers
9
Top Authors
4
Data Points

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework namedSelf-VisualRetrieval-AugmentedGeneration (SV-RAG), which can broaden horizons ofanyMLLM to support long-document understanding. We demonstrate thatMLLMs themselves can be an effective multimodal retrieverto fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

Citation History

Jan 25, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0