Poster "language modeling" Papers
34 papers found
Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
Heli Ben-Hamu, Itai Gat, Daniel Severo et al.
AdaFisher: Adaptive Second Order Optimization via Fisher Information
Damien GOMES, Yanlei Zhang, Eugene Belilovsky et al.
Chunk-Distilled Language Modeling
Yanhong Li, Karen Livescu, Jiawei Zhou
Continuous Diffusion Model for Language Modeling
Jaehyeong Jo, Sung Ju Hwang
Differential Transformer
Tianzhu Ye, Li Dong, Yuqing Xia et al.
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin VIDEAU, Badr Youbi Idrissi, Alessandro Leite et al.
Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam
Language Models Are Implicitly Continuous
Samuele Marro, Davide Evangelista, X. Huang et al.
MIND over Body: Adaptive Thinking using Dynamic Computation
Mrinal Mathur, Barak Pearlmutter, Sergey Plis
Nested Learning: The Illusion of Deep Learning Architectures
Ali Behrouz, Meisam Razaviyayn, Peilin Zhong et al.
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Cai Zhou, Chenyu Wang, Dinghuai Zhang et al.
Selective Attention Improves Transformer
Yaniv Leviathan, Matan Kalman, Yossi Matias
ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation
Yuxuan Song, Zhe Zhang, Yu Pei et al.
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, David Grangier
Tight Clusters Make Specialized Experts
Stefan Nielsen, Rachel Teo, Laziz Abdullaev et al.
AMPA: Adaptive Mixed Precision Allocation for Low-Bit Integer Training
Li Ding, Wen Fei, Yuyang Huang et al.
An Independence-promoting Loss for Music Generation with Language Models
Jean-Marie Lemercier, Simon Rouard, Jade Copet et al.
Can Mamba Learn How To Learn? A Comparative Study on In-Context Learning Tasks
Jong Ho Park, Jaden Park, Zheyang Xiong et al.
Differentiable Model Scaling using Differentiable Topk
Kai Liu, Ruohui Wang, Jianfei Gao et al.
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, Stefano Ermon
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
David T. Hoffmann, Simon Schrodi, Jelena Bratulić et al.
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen et al.
Improving Transformers with Dynamically Composable Multi-Head Attention
Da Xiao, Qingye Meng, Shengping Li et al.
In-Context Language Learning: Architectures and Algorithms
Ekin Akyürek, Bailin Wang, Yoon Kim et al.
Matrix Information Theory for Self-Supervised Learning
Yifan Zhang, Zhiquan Tan, Jingqin Yang et al.
Modeling Language Tokens as Functionals of Semantic Fields
Zhengqi Pei, Anran Zhang, Shuhui Wang et al.
MultiMax: Sparse and Multi-Modal Attention Learning
Yuxuan Zhou, Mario Fritz, Margret Keuper
PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels
Praneeth Kacham, Vahab Mirrokni, Peilin Zhong
Positive Concave Deep Equilibrium Models
Mateusz Gabor, Tomasz Piotrowski, Renato L. G. Cavalcante
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization
Jialong Guo, Xinghao Chen, Yehui Tang et al.
SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms
Xingrun Xing, Zheng Zhang, Ziyi Ni et al.
StableMask: Refining Causal Masking in Decoder-only Transformer
Qingyu Yin, Xuzheng He, Xiang Zhuang et al.
Trainable Transformer in Transformer
Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia et al.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao, Albert Gu