From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

0citations

PDF

Citations

#10

in ICML 2024

of 2635 papers

Authors

Data Points

Authors

Muhammed Emrullah Ildiz Yixiao HUANG Yingcong Li Ankit Singh Rawat Samet Oymak

Topics

self-attention mechanism markov models generative transformers teacher-student setting sample complexity context-conditioned markov chain text generation repetitive text generation

Abstract

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and the associated outputs sampled from the model. We first establish a formal link between the self-attention mechanism and Markov models under suitable conditions: Inputting a prompt to the self-attention model samples the output token according to acontext-conditioned Markov chain(CCMC).CCMCis obtained by weighing the transition matrix of a standard Markov chain according to the sufficient statistics of the prompt/context. Building on this formalism, we develop identifiability/coverage conditions for the data distribution that guarantee consistent estimation of the latent model under a teacher-student setting and establish sample complexity guarantees under IID data. Finally, we study the problem of learning from a single output trajectory generated in response to an initial prompt. We characterize awinner-takes-allphenomenon where the generative process of self-attention evolves to sampling from a small set ofwinner tokensthat dominate the context window. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text.

Citation History

Jan 28, 2026