"inference acceleration" Papers

32 papers found

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Justin Deschenaux, Caglar Gulcehre

ICLR 2025posterarXiv:2410.21035
25
citations

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat et al.

ICLR 2025posterarXiv:2405.14105
7
citations

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Marianne Arriola, Yair Schiff, Hao Phung et al.

NeurIPS 2025posterarXiv:2510.22852
1
citations

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Tianyun Zhong, Chao Liang, Jianwen Jiang et al.

CVPR 2025posterarXiv:2412.16915
5
citations

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu, Sanket Jayant Purandare, Stratos Idreos et al.

ICLR 2025posterarXiv:2410.12982
2
citations

Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So, Juncheol Shin, Hyunho Kook et al.

ICCV 2025posterarXiv:2508.07747
3
citations

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei et al.

NeurIPS 2025posterarXiv:2502.16002
24
citations

Language Models Can Predict Their Own Behavior

Dhananjay Ashok, Jonathan May

NeurIPS 2025posterarXiv:2502.13329
5
citations

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Xiaohao Liu, Xiaobo Xia, Weixiang Zhao et al.

NeurIPS 2025posterarXiv:2505.17505
5
citations

Quantization without Tears

Minghao Fu, Hao Yu, Jie Shao et al.

CVPR 2025posterarXiv:2411.13918
14
citations

SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models

Jaerin Lee, Daniel Jung, Kanggeon Lee et al.

CVPR 2025posterarXiv:2403.09055
3
citations

Simple ReFlow: Improved Techniques for Fast Flow Models

Beomsu Kim, Yu-Guan Hsieh, Michal Klein et al.

ICLR 2025posterarXiv:2410.07815
28
citations

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang et al.

ICLR 2025posterarXiv:2411.05007
90
citations

TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

Minghao Fu, Guo-Hua Wang, Xiaohao Chen et al.

ICCV 2025posterarXiv:2507.18192

TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Fanxu Meng, Pingzhi Tang, Zengwei Yao et al.

NeurIPS 2025spotlight

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NeurIPS 2025posterarXiv:2509.15235
2
citations

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024posterarXiv:2403.06764
343
citations

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere et al.

ICML 2024poster

Data-free Distillation of Diffusion Models with Bootstrapping

Jiatao Gu, Chen Wang, Shuangfei Zhai et al.

ICML 2024poster

DiJiang: Efficient Large Language Models through Compact Kernelization

Hanting Chen, Liuzhicheng Liuzhicheng, Xutao Wang et al.

ICML 2024poster

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes et al.

ECCV 2024posterarXiv:2405.05967
75
citations

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang et al.

ICML 2024poster

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Yanxi Chen, Xuchen Pan, Yaliang Li et al.

ICML 2024poster

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Bumsoo Kim, Jinhyung Kim, Yeonsik Jo et al.

AAAI 2024paperarXiv:2312.12659
5
citations

Fluctuation-Based Adaptive Structured Pruning for Large Language Models

Yongqi An, Xu Zhao, Tao Yu et al.

AAAI 2024paperarXiv:2312.11983
96
citations

How Deep Do We Need: Accelerating Training and Inference of Neural ODEs via Control Perspective

Keyan Miao, Konstantinos Gatsis

ICML 2024oral

Online Speculative Decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis et al.

ICML 2024poster

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

Xiang Meng, Shibal Ibrahim, Kayhan Behdin et al.

ICML 2024poster

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin, You Wu, Zhenyu Zhang et al.

ICML 2024poster

REST: Efficient and Accelerated EEG Seizure Analysis through Residual State Updates

Arshia Afzal, Grigorios Chrysos, Volkan Cevher et al.

ICML 2024oral

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Jiwon Song, Kyungseok Oh, Taesu Kim et al.

ICML 2024poster

Switchable Decision: Dynamic Neural Generation Networks

Shujian Zhang, Korawat Tanwisuth, Chengyue Gong et al.

ICML 2024poster