"inference acceleration" Papers

53 papers found • Page 1 of 2

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Hui Zhang, Zuxuan Wu, Zhen Xing et al.

AAAI 2025paperarXiv:2311.14768
20
citations

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Justin Deschenaux, Caglar Gulcehre

ICLR 2025arXiv:2410.21035
29
citations

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu et al.

ICCV 2025arXiv:2412.01818
45
citations

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Chenyang Song, Weilin Zhao, Xu Han et al.

COLM 2025paperarXiv:2507.08771

Block Verification Accelerates Speculative Decoding

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan et al.

ICLR 2025arXiv:2403.10444
19
citations

Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Kunyun Wang, Bohan Li, Kai Yu et al.

NEURIPS 2025arXiv:2505.14741
1
citations

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat et al.

ICLR 2025arXiv:2405.14105
7
citations

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang et al.

NEURIPS 2025arXiv:2505.15781
72
citations

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang et al.

NEURIPS 2025arXiv:2503.01840
115
citations

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Marianne Arriola, Yair Schiff, Hao Phung et al.

NEURIPS 2025arXiv:2510.22852
6
citations

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Tianyun Zhong, Chao Liang, Jianwen Jiang et al.

CVPR 2025arXiv:2412.16915
5
citations

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Zhengyao Lyu, Chenyang Si, Junhao Song et al.

ICLR 2025oralarXiv:2410.19355
58
citations

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu, Sanket Jayant Purandare, Stratos Idreos et al.

ICLR 2025arXiv:2410.12982
2
citations

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Shijing Hu, Jingyang Li, Xingyu Xie et al.

NEURIPS 2025arXiv:2502.11018
3
citations

Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So, Juncheol Shin, Hyunho Kook et al.

ICCV 2025arXiv:2508.07747
3
citations

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda TAO, Can Qin et al.

NEURIPS 2025oralarXiv:2505.21334
20
citations

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei et al.

NEURIPS 2025arXiv:2502.16002
29
citations

Language Models Can Predict Their Own Behavior

Dhananjay Ashok, Jonathan May

NEURIPS 2025arXiv:2502.13329
5
citations

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Xuan Shen, Zhao Song, Yufa Zhou et al.

AAAI 2025paperarXiv:2412.12444
38
citations

LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao, Ping Chen, Fuyuan Shi et al.

NEURIPS 2025spotlightarXiv:2511.00090
1
citations

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Xiaohao Liu, Xiaobo Xia, Weixiang Zhao et al.

NEURIPS 2025arXiv:2505.17505
5
citations

ParaSolver: A Hierarchical Parallel Integral Solver for Diffusion Models

Jianrong Lu, Zhiyu Zhu, Junhui Hou

ICLR 2025
4
citations

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Yong Liu, Hang Dong, Jinshan Pan et al.

ICCV 2025arXiv:2405.17158
1
citations

Presto! Distilling Steps and Layers for Accelerating Music Generation

Zachary Novack, Ge Zhu, Jonah Casebeer et al.

ICLR 2025arXiv:2410.05167
16
citations

Quantization without Tears

Minghao Fu, Hao Yu, Jie Shao et al.

CVPR 2025arXiv:2411.13918
16
citations

Reward Guided Latent Consistency Distillation

William Wang, Jiachen Li, Weixi Feng et al.

ICLR 2025arXiv:2403.11027
27
citations

SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models

Jaerin Lee, Daniel Jung, Kanggeon Lee et al.

CVPR 2025arXiv:2403.09055
3
citations

Simple ReFlow: Improved Techniques for Fast Flow Models

Beomsu Kim, Yu-Guan Hsieh, Michal Klein et al.

ICLR 2025arXiv:2410.07815
28
citations

SLMRec: Distilling Large Language Models into Small for Sequential Recommendation

Wujiang Xu, Qitian Wu, Zujie Liang et al.

ICLR 2025oralarXiv:2405.17890
18
citations

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng, Fu-Yun Wang, Xian Liu et al.

NEURIPS 2025arXiv:2510.08994

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang et al.

ICLR 2025arXiv:2411.05007
98
citations

TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

Minghao Fu, Guo-Hua Wang, Xiaohao Chen et al.

ICCV 2025arXiv:2507.18192

TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Fanxu Meng, Pingzhi Tang, Zengwei Yao et al.

NEURIPS 2025spotlight

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li et al.

NEURIPS 2025arXiv:2509.15235
2
citations

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

Siyu Xu, Yunke Wang, Chenghao Xia et al.

NEURIPS 2025oralarXiv:2502.02175
27
citations

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024arXiv:2403.06764
368
citations

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere et al.

ICML 2024arXiv:2404.19737
232
citations

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

YUXI REN, Jie Wu, Yanzuo Lu et al.

ECCV 2024arXiv:2404.04860
10
citations

Data-free Distillation of Diffusion Models with Bootstrapping

Jiatao Gu, Chen Wang, Shuangfei Zhai et al.

ICML 2024

DiJiang: Efficient Large Language Models through Compact Kernelization

Hanting Chen, Liuzhicheng Liuzhicheng, Xutao Wang et al.

ICML 2024arXiv:2403.19928
11
citations

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes et al.

ECCV 2024arXiv:2405.05967
77
citations

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang et al.

ICML 2024arXiv:2401.15077
338
citations

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Yanxi Chen, Xuchen Pan, Yaliang Li et al.

ICML 2024arXiv:2312.04916
60
citations

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Bumsoo Kim, Jinhyung Kim, Yeonsik Jo et al.

AAAI 2024paperarXiv:2312.12659
5
citations

Fluctuation-Based Adaptive Structured Pruning for Large Language Models

Yongqi An, Xu Zhao, Tao Yu et al.

AAAI 2024paperarXiv:2312.11983
106
citations

How Deep Do We Need: Accelerating Training and Inference of Neural ODEs via Control Perspective

Keyan Miao, Konstantinos Gatsis

ICML 2024oral

Online Speculative Decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis et al.

ICML 2024arXiv:2310.07177
92
citations

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

Xiang Meng, Shibal Ibrahim, Kayhan Behdin et al.

ICML 2024arXiv:2403.12983
13
citations

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin, You Wu, Zhenyu Zhang et al.

ICML 2024arXiv:2310.05175
152
citations

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu et al.

ECCV 2024arXiv:2403.16020
7
citations
PreviousNext