"speculative decoding" Papers

29 papers found

Approximately Aligned Decoding

Daniel Melcer, Sujan Kumar Gonugondla, Pramuditha Perera et al.

NEURIPS 2025arXiv:2410.01103
2
citations

Block Verification Accelerates Speculative Decoding

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan et al.

ICLR 2025arXiv:2403.10444
19
citations

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Jinwei Yao, Kaiqi Chen, Kexun Zhang et al.

ICLR 2025arXiv:2404.00242
9
citations

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Yize Wu, KE GAO, Ling Li et al.

NEURIPS 2025arXiv:2502.02493
1
citations

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat et al.

ICLR 2025arXiv:2405.19261
23
citations

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Shijing Hu, Jingyang Li, Xingyu Xie et al.

NEURIPS 2025arXiv:2502.11018
3
citations

Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So, Juncheol Shin, Hyunho Kook et al.

ICCV 2025arXiv:2508.07747
3
citations

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

Doohyuk Jang, Sihwan Park, June Yong Yang et al.

ICLR 2025arXiv:2410.03355
30
citations

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.

ICLR 2025arXiv:2408.11049
64
citations

Mixture of Attentions For Speculative Decoding

Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras et al.

ICLR 2025arXiv:2410.03804
14
citations

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Kaixuan Huang, Xudong Guo, Mengdi Wang

COLM 2025paper
46
citations

SpecEM: Training-Free LLM Ensembling via Iterative Drafting, Verification, and Online Feedback

Bo Lv, Nayu Liu, Chen Tang et al.

NEURIPS 2025

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Thomas Walton, Darin Tsui, Aryan Musharaf et al.

NEURIPS 2025spotlightarXiv:2509.21689
1
citations

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Rui Pan, Yinwei Dai, Zhihao Zhang et al.

NEURIPS 2025arXiv:2504.07891
37
citations

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng, Fu-Yun Wang, Xian Liu et al.

NEURIPS 2025arXiv:2510.08994

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.

ICLR 2025arXiv:2407.08223
78
citations

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Heming Xia, Yongqi Li, Jun Zhang et al.

ICLR 2025arXiv:2410.06916
41
citations

Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

Xiaoxiao Ma, Feng Zhao, Pengyang Ling et al.

NEURIPS 2025arXiv:2510.09012
3
citations

Towards Optimal Multi-draft Speculative Decoding

Zhengmian Hu, Tong Zheng, Vignesh Viswanathan et al.

ICLR 2025arXiv:2502.18779
11
citations

TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding

Shukai Gong, YIYANG FU, Fengyuan Ran et al.

NEURIPS 2025oralarXiv:2507.09252

Traversal Verification for Speculative Tree Decoding

Yepeng Weng, Qiao Hu, Xujie Chen et al.

NEURIPS 2025arXiv:2505.12398
3
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025arXiv:2509.15235
2
citations

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Yichao Fu, Peter Bailis, Ion Stoica et al.

ICML 2024arXiv:2402.02057
257
citations

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen et al.

ICML 2024arXiv:2402.02082
65
citations

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

Luca Beurer-Kellner, Marc Fischer, Martin Vechev

ICML 2024arXiv:2403.06988
82
citations

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng et al.

ICML 2024arXiv:2401.10774
549
citations

Online Speculative Decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis et al.

ICML 2024arXiv:2310.07177
92
citations

Tandem Transformers for Inference Efficient LLMs

Aishwarya P S, Pranav Nair, Yashas Samaga et al.

ICML 2024arXiv:2402.08644
10
citations

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Haoran You, Yichao Fu, Zheng Wang et al.

ICML 2024arXiv:2406.07368
9
citations