"language modeling" Papers
31 papers found
Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
Heli Ben-Hamu, Itai Gat, Daniel Severo et al.
Chunk-Distilled Language Modeling
Yanhong Li, Karen Livescu, Jiawei Zhou
Continuous Diffusion Model for Language Modeling
Jaehyeong Jo, Sung Ju Hwang
Differential Transformer
Tianzhu Ye, Li Dong, Yuqing Xia et al.
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin VIDEAU, Badr Youbi Idrissi, Alessandro Leite et al.
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Cai Zhou, Chenyu Wang, Dinghuai Zhang et al.
Scaling up Masked Diffusion Models on Text
Shen Nie, Fengqi Zhu, Chao Du et al.
ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation
Yuxuan Song, Zhe Zhang, Yu Pei et al.
Tight Clusters Make Specialized Experts
Stefan Nielsen, Rachel Teo, Laziz Abdullaev et al.
AMPA: Adaptive Mixed Precision Allocation for Low-Bit Integer Training
Li Ding, Wen Fei, Yuyang Huang et al.
An Independence-promoting Loss for Music Generation with Language Models
Jean-Marie Lemercier, Simon Rouard, Jade Copet et al.
Cached Transformers: Improving Transformers with Differentiable Memory Cached
Zhaoyang Zhang, Wenqi Shao, Yixiao Ge et al.
Can Mamba Learn How To Learn? A Comparative Study on In-Context Learning Tasks
Jong Ho Park, Jaden Park, Zheyang Xiong et al.
Differentiable Model Scaling using Differentiable Topk
Kai Liu, Ruohui Wang, Jianfei Gao et al.
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, Stefano Ermon
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
David T. Hoffmann, Simon Schrodi, Jelena Bratulić et al.
Exploring Transformer Extrapolation
Zhen Qin, Yiran Zhong, Hui Deng
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen et al.
Improving Transformers with Dynamically Composable Multi-Head Attention
Da Xiao, Qingye Meng, Shengping Li et al.
In-Context Language Learning: Architectures and Algorithms
Ekin Akyürek, Bailin Wang, Yoon Kim et al.
Matrix Information Theory for Self-Supervised Learning
Yifan Zhang, Zhiquan Tan, Jingqin Yang et al.
Modeling Language Tokens as Functionals of Semantic Fields
Zhengqi Pei, Anran Zhang, Shuhui Wang et al.
MultiMax: Sparse and Multi-Modal Attention Learning
Yuxuan Zhou, Mario Fritz, Margret Keuper
PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels
Praneeth Kacham, Vahab Mirrokni, Peilin Zhong
Positive Concave Deep Equilibrium Models
Mateusz Gabor, Tomasz Piotrowski, Renato L. G. Cavalcante
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization
Jialong Guo, Xinghao Chen, Yehui Tang et al.
SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms
Xingrun Xing, Zheng Zhang, Ziyi Ni et al.
StableMask: Refining Causal Masking in Decoder-only Transformer
Qingyu Yin, Xuzheng He, Xiang Zhuang et al.
Stay on Topic with Classifier-Free Guidance
Guillaume Sanchez, Alexander Spangher, Honglu Fan et al.
Trainable Transformer in Transformer
Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia et al.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao, Albert Gu