Unlearning-based Neural Interpretations

0citations
0
Citations
#1954
in ICLR 2025
of 3827 papers
3
Authors
3
Data Points

Abstract

Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions—constant mapping, averaging or blurring—inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose $\texttt{UNI}$ to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an $\textit{unlearning direction}$ of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust interpretations.

Citation History

Jan 26, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0