Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models

1citations

arXiv:2507.09185 PDF Project

citations

#178

in COLM 2025

of 263 papers

Top Authors

Data Points

Top Authors

Ameen Ali Ali Shahar Katz Lior Wolf Ivan Titov

Topics

LLMs spurious correlations Integrated Gradients generalization model adaptation

Abstract

Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

Citation History

Feb 12, 2026