"speculative decoding" Papers

17 papers found

Approximately Aligned Decoding

Daniel Melcer, Sujan Kumar Gonugondla, Pramuditha Perera et al.

NeurIPS 2025posterarXiv:2410.01103
2
citations

Block Verification Accelerates Speculative Decoding

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan et al.

ICLR 2025posterarXiv:2403.10444
18
citations

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Yize Wu, KE GAO, Ling Li et al.

NeurIPS 2025posterarXiv:2502.02493
1
citations

Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So, Juncheol Shin, Hyunho Kook et al.

ICCV 2025posterarXiv:2508.07747
3
citations

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.

ICLR 2025posterarXiv:2408.11049
61
citations

SpecEM: Training-Free LLM Ensembling via Iterative Drafting, Verification, and Online Feedback

Bo Lv, Nayu Liu, Chen Tang et al.

NeurIPS 2025poster

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Rui Pan, Yinwei Dai, Zhihao Zhang et al.

NeurIPS 2025posterarXiv:2504.07891
35
citations

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.

ICLR 2025posterarXiv:2407.08223
75
citations

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Heming Xia, Yongqi Li, Jun Zhang et al.

ICLR 2025posterarXiv:2410.06916
39
citations

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NeurIPS 2025posterarXiv:2509.15235
2
citations

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Yichao Fu, Peter Bailis, Ion Stoica et al.

ICML 2024poster

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen et al.

ICML 2024poster

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

Luca Beurer-Kellner, Marc Fischer, Martin Vechev

ICML 2024poster

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng et al.

ICML 2024poster

Online Speculative Decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis et al.

ICML 2024poster

Tandem Transformers for Inference Efficient LLMs

Aishwarya P S, Pranav Nair, Yashas Samaga et al.

ICML 2024poster

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Haoran You, Yichao Fu, Zheng Wang et al.

ICML 2024poster