Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

0citations

Citations

#1943

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Yue Guan Changming Yu Shihan Fang Weiming Hu Zaifeng Pan Zheng Wang Zihan Liu Yangjie Zhou Yufei Ding Minyi Guo Jingwen Leng

Abstract

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026