Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers
Top Authors
Abstract
Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Evenat initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapsein depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapsein width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii).Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.