"gradient descent analysis" Papers
4 papers found
Benign Overfitting in Single-Head Attention
Roey Magen, Shuning Shang, Zhiwei Xu et al.
NeurIPS 2025posterarXiv:2410.07746
6
citations
How do Transformers Perform In-Context Autoregressive Learning ?
Michael Sander, Raja Giryes, Taiji Suzuki et al.
ICML 2024poster
How Transformers Learn Causal Structure with Gradient Descent
Eshaan Nichani, Alex Damian, Jason Lee
ICML 2024poster
In-context Convergence of Transformers
Yu Huang, Yuan Cheng, Yingbin LIANG
ICML 2024poster