Transformers, parallel computation, and logarithmic depth

0citations

PDF Project

Citations

#10

in ICML 2024

of 2635 papers

Authors

Data Points

Authors

Clayton Sanford Daniel Hsu Matus Telgarsky

Topics

self-attention layers parallel computation logarithmic-depth transformers neural sequence models sub-quadratic approximations computational complexity

Abstract

We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds ofMassively Parallel Computation. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Citation History

Jan 28, 2026