"inference acceleration" Papers
53 papers found • Page 1 of 2
Conference
AdaDiff: Adaptive Step Selection for Fast Diffusion Models
Hui Zhang, Zuxuan Wu, Zhen Xing et al.
Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
Justin Deschenaux, Caglar Gulcehre
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Chenyang Song, Weilin Zhao, Xu Han et al.
Block Verification Accelerates Speculative Decoding
Ziteng Sun, Uri Mendlovic, Yaniv Leviathan et al.
Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism
Kunyun Wang, Bohan Li, Kai Yu et al.
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
Nadav Timor, Jonathan Mamou, Daniel Korat et al.
dKV-Cache: The Cache for Diffusion Language Models
Xinyin Ma, Runpeng Yu, Gongfan Fang et al.
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang et al.
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference
Marianne Arriola, Yair Schiff, Hao Phung et al.
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
Tianyun Zhong, Chao Liang, Jianwen Jiang et al.
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
Zhengyao Lyu, Chenyang Si, Junhao Song et al.
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
Costin-Andrei Oncescu, Sanket Jayant Purandare, Stratos Idreos et al.
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
Shijing Hu, Jingyang Li, Xingyu Xie et al.
Grouped Speculative Decoding for Autoregressive Image Generation
Junhyuk So, Juncheol Shin, Hyunho Kook et al.
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao, Keda TAO, Can Qin et al.
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
Jingbo Yang, Bairu Hou, Wei Wei et al.
Language Models Can Predict Their Own Behavior
Dhananjay Ashok, Jonathan May
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers
Xuan Shen, Zhao Song, Yufa Zhou et al.
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation
Huanlin Gao, Ping Chen, Fuyuan Shi et al.
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Xiaohao Liu, Xiaobo Xia, Weixiang Zhao et al.
ParaSolver: A Hierarchical Parallel Integral Solver for Diffusion Models
Jianrong Lu, Zhiyu Zhu, Junhui Hou
PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution
Yong Liu, Hang Dong, Jinshan Pan et al.
Presto! Distilling Steps and Layers for Accelerating Music Generation
Zachary Novack, Ge Zhu, Jonah Casebeer et al.
Quantization without Tears
Minghao Fu, Hao Yu, Jie Shao et al.
Reward Guided Latent Consistency Distillation
William Wang, Jiachen Li, Weixi Feng et al.
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
Jaerin Lee, Daniel Jung, Kanggeon Lee et al.
Simple ReFlow: Improved Techniques for Fast Flow Models
Beomsu Kim, Yu-Guan Hsieh, Michal Klein et al.
SLMRec: Distilling Large Language Models into Small for Sequential Recommendation
Wujiang Xu, Qitian Wu, Zujie Liang et al.
Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
Yao Teng, Fu-Yun Wang, Xian Liu et al.
SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models
Muyang Li, Yujun Lin, Zhekai Zhang et al.
TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance
Minghao Fu, Guo-Hua Wang, Xiaohao Chen et al.
TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup
Fanxu Meng, Pingzhi Tang, Zengwei Yao et al.
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Jialiang Kang, Han Shu, Wenshuo Li et al.
VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching
Siyu Xu, Yunke Wang, Chenghao Xia et al.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu et al.
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere et al.
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
YUXI REN, Jie Wu, Yanzuo Lu et al.
Data-free Distillation of Diffusion Models with Bootstrapping
Jiatao Gu, Chen Wang, Shuangfei Zhai et al.
DiJiang: Efficient Large Language Models through Compact Kernelization
Hanting Chen, Liuzhicheng Liuzhicheng, Xutao Wang et al.
Distilling Diffusion Models into Conditional GANs
Minguk Kang, Richard Zhang, Connelly Barnes et al.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang et al.
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen, Xuchen Pan, Yaliang Li et al.
Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders
Bumsoo Kim, Jinhyung Kim, Yeonsik Jo et al.
Fluctuation-Based Adaptive Structured Pruning for Large Language Models
Yongqi An, Xu Zhao, Tao Yu et al.
How Deep Do We Need: Accelerating Training and Inference of Neural ODEs via Control Perspective
Keyan Miao, Konstantinos Gatsis
Online Speculative Decoding
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis et al.
OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization
Xiang Meng, Shibal Ibrahim, Kayhan Behdin et al.
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin, You Wu, Zhenyu Zhang et al.
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu et al.