Decomposing and Editing Predictions by Modeling Model Computation

0citations

PDF Project

Citations

#10

in ICML 2024

of 2635 papers

Authors

Data Points

Authors

Harshay Shah Andrew Ilyas Aleksander Madry

Topics

component attribution model editing attention heads convolution filters counterfactual impact model decomposition internal computation analysis

Abstract

How does the internal computation of a machine learning model transform inputs into predictions?To tackle this question, we introduce a framework calledcomponent modelingfor decomposing a model prediction in terms of its components---architectural "building blocks" such as convolution filters or attention heads. We focus on a special case of this framework,component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets and modalities. Finally, we show that COAR directly enables effective model editing. Our code is available atgithub.com/MadryLab/modelcomponents.

Citation History

Jan 28, 2026