"sparse activation" Papers
3 papers found
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Ziteng Wang, Jun Zhu, Jianfei Chen
ICLR 2025posterarXiv:2412.14711
28
citations
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Xiaoming Shi, Shiyu Wang, Yuqi Nie et al.
ICLR 2025posterarXiv:2409.16040
178
citations
Exploring the Benefit of Activation Sparsity in Pre-training
Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin et al.
ICML 2024poster