ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

0citations

arXiv:2410.11414 Project

citations

#2434

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Zhongxiang Sun Xiaoxue Zang Kai Zheng Jun Xu Xiao Zhang Weijie Yu Yang Song Han Li

Abstract

Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) balance external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when theKnowledge FFNsin LLMs overemphasize parametric knowledge in the residual stream, whileCopying Headsfail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we proposeReDeEP, a novel method that detects hallucinations by decoupling LLM’s utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026