ICML "transformers architecture" Papers
2 papers found
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context
Xiang Cheng, Yuxin Chen, Suvrit Sra
ICML 2024posterarXiv:2312.06528
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Zixuan Wang, Stanley Wei, Daniel Hsu et al.
ICML 2024posterarXiv:2406.06893