ICML 2024 "length generalization" Papers
3 papers found
Case-Based or Rule-Based: How Do Transformers Do the Math?
Yi Hu, Xiaojuan Tang, Haotong Yang et al.
ICML 2024poster
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen et al.
ICML 2024poster
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Zixuan Wang, Stanley Wei, Daniel Hsu et al.
ICML 2024poster