🧬Efficiency

Efficient Inference

Fast and efficient model inference

100 papers2,421 total citations
Compare with other topics
Feb '24 Jan '26410 papers
Also includes: efficient inference, inference optimization, computational efficiency, fast inference

Top Papers

#1

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu et al.

ICLR 2024
372
citations
#2

Training Language Models to Reason Efficiently

Daman Arora, Andrea Zanette

NeurIPS 2025
155
citations
#3

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Yangzhen Wu, Zhiqing Sun, Shanda Li et al.

ICLR 2025
inference scaling lawscompute-optimal inferencelarge language modelstest-time scaling+4
146
citations
#4

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Yifan Wang, Xingyi He, Sida Peng et al.

CVPR 2024
142
citations
#5

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao, Min Zhang, Wei Zhao et al.

AAAI 2025
106
citations
#6

DEIM: DETR with Improved Matching for Fast Convergence

Shihua Huang, Zhichao Lu, Xiaodong Cun et al.

CVPR 2025
93
citations
#7

Consistency Models Made Easy

Zhengyang Geng, Ashwini Pokle, Weijian Luo et al.

ICLR 2025
81
citations
#8

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang et al.

NeurIPS 2025
64
citations
#9

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.

ICLR 2025
63
citations
#10

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.

ICML 2025
56
citations
#11

Inductive Moment Matching

Linqi (Alex) Zhou, Stefano Ermon, Jiaming Song

ICML 2025
54
citations
#12

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo et al.

ICLR 2025arXiv:2502.20766
attention mechanismsparse attentionlong-sequence inferencequery-aware patterns+2
51
citations
#13

Visual Agents as Fast and Slow Thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang et al.

ICLR 2025
44
citations
#14

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Heming Xia, Yongqi Li, Jun Zhang et al.

ICLR 2025arXiv:2410.06916
speculative decodingllm inference accelerationlayer-skippingself-speculative decoding+3
39
citations
#15

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

Yang Zhou, Hao Shao, Letian Wang et al.

CVPR 2024
38
citations
#16

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang, Pei Zhang, Siyuan Huang et al.

NeurIPS 2025
38
citations
#17

TinySAM: Pushing the Envelope for Efficient Segment Anything Model

Han Shu, Wenshuo Li, Yehui Tang et al.

AAAI 2025
37
citations
#18

LION: Implicit Vision Prompt Tuning

Haixin Wang, Jianlong Chang, Yihang Zhai et al.

AAAI 2024arXiv:2303.09992
vision prompt tuningimplicit layersvision transformerscomputational efficiency+4
35
citations
#19

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Barys Liskavets, Maxim Ushakov, Shuvendu Roy et al.

AAAI 2025
31
citations
#20

Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed

Yubin Xiao, Di Wang, Boyang Li et al.

AAAI 2024arXiv:2312.12469
knowledge distillationautoregressive modelsnon-autoregressive modelsvehicle routing problems+2
31
citations
#21

Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms

Yinuo Ren, Haoxuan Chen, Yuchen Zhu et al.

NeurIPS 2025arXiv:2502.00234
discrete diffusion modelshigh-order algorithmsnumerical inference schemesgenerative modeling+4
29
citations
#22

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

ICCV 2025
28
citations
#23

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Yizhe Xiong, Hui Chen, Tianxiang Hao et al.

ECCV 2024
26
citations
#24

HyperFast: Instant Classification for Tabular Data

David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto et al.

AAAI 2024arXiv:2402.14335
tabular data classificationhypernetwork architecturemeta-trained modelsinstant inference+4
26
citations
#25

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

ZUYAN LIU, Benlin Liu, Jiahui Wang et al.

ECCV 2024
25
citations
#26

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Kai Wang, Mingjia Shi, YuKun Zhou et al.

CVPR 2025
24
citations
#27

$\text{D}_{2}\text{O}$: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Zhongwei Wan, Xinjian Wu, Yu Zhang et al.

ICLR 2025
kv cache optimizationattention score analysislong-context inferencegenerative inference efficiency+2
22
citations
#28

Conditional Information Bottleneck Approach for Time Series Imputation

MinGyu Choi, Changhee Lee

ICLR 2024
21
citations
#29

Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians

Ishan Amin, Sanjeev Raja, Aditi Krishnapriyan

ICLR 2025
21
citations
#30

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Cheng Yang, Yang Sui, Jinqi Xiao et al.

CVPR 2025arXiv:2503.18278
token pruningvision-language modelsinference time optimizationkv cache reduction+3
20
citations
#31

Efficiently Scaling LLM Reasoning Programs with Certaindex

Yichao Fu, Junda Chen, Siqi Zhu et al.

NeurIPS 2025
19
citations
#32

Falcon: Faster and Parallel Inference of Large Language Models Through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang et al.

AAAI 2025
15
citations
#33

DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation

Changdae Oh, Yixuan Li, Kyungwoo Song et al.

ICLR 2025
15
citations
#34

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong et al.

NeurIPS 2025
14
citations
#35

The Need for Speed: Pruning Transformers with One Recipe

Samir Khaki, Konstantinos Plataniotis

ICLR 2024
14
citations
#36

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li et al.

ICML 2025
14
citations
#37

MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks

Zhiyu Zhu, Huaming Chen, Jiayu Zhang et al.

AAAI 2024arXiv:2312.13630
attribution methodsmodel interpretabilityboundary-based attributionsensitivity axiom+3
14
citations
#38

MERGE: Fast Private Text Generation

Zi Liang, Pinghui Wang, Ruofei Zhang et al.

AAAI 2024arXiv:2305.15769
private inferencetransformer-based modelsnatural language generationcloud model deployment+4
14
citations
#39

Efficient Inference for Large Language Model-based Generative Recommendation

Xinyu Lin, Chaoqun Yang, Wenjie Wang et al.

ICLR 2025
13
citations
#40

Scaling Inference Time Compute for Diffusion Models

Nanye Ma, Shangyuan Tong, Haolin Jia et al.

CVPR 2025
13
citations
#41

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters

Kevin Li, Sachin Goyal, João D Semedo et al.

ICLR 2025
12
citations
#42

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Chen Ju, Haicheng Wang, Haozhe Cheng et al.

ECCV 2024
12
citations
#43

Imputation for prediction: beware of diminishing returns.

Marine Le Morvan, Gael Varoquaux

ICLR 2025arXiv:2407.19804
missing value imputationpredictive modelingmissingness indicatorsimputation accuracy+4
12
citations
#44

Revisiting In-context Learning Inference Circuit in Large Language Models

Hakaze Cho, Mariko Kato, Yoshihiro Sakai et al.

ICLR 2025
11
citations
#45

Data-Efficient Multimodal Fusion on a Single GPU

Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti et al.

CVPR 2024
10
citations
#46

Understanding and Improving Optimization in Predictive Coding Networks

Nicholas Alonso, Jeffrey Krichmar, Emre Neftci

AAAI 2024arXiv:2305.13562
predictive coding networksinference learning algorithmbiological plausibilityoptimization methods+3
10
citations
#47

Variational Inference for SDEs Driven by Fractional Noise

Rembert Daems, Manfred Opper, Guillaume Crevecoeur et al.

ICLR 2024
10
citations
#48

Colour Passing Revisited: Lifted Model Construction with Commutative Factors

Malte Luttermann, Tanya Braun, Ralf Möller et al.

AAAI 2024arXiv:2309.11236
lifted probabilistic inferencesymmetry detectionprobabilistic model compressioncolour passing algorithm+4
10
citations
#49

Estimating Conditional Mutual Information for Dynamic Feature Selection

Soham Gadgil, Ian Covert, Su-In Lee

ICLR 2024
10
citations
#50

PowerMLP: An Efficient Version of KAN

Ruichen Qiu, Yibo Miao, Shiwen Wang et al.

AAAI 2025
9
citations
#51

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Yunhong Lu, Qichao Wang, Hengyuan Cao et al.

CVPR 2025
9
citations
#52

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Nadav Timor, Jonathan Mamou, Daniel Korat et al.

ICML 2025
9
citations
#53

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

Mengfei Xia, Yujun Shen, Changsong Lei et al.

CVPR 2024
9
citations
#54

Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models

Reza Shirkavand, Peiran Yu, Shangqian Gao et al.

CVPR 2025
8
citations
#55

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

Zhikai Li, Xuewen Liu, Dongrong Joe Fu et al.

CVPR 2025
8
citations
#56

Compositional simulation-based inference for time series

Manuel Gloeckler, Shoji Toyota, Kenji Fukumizu et al.

ICLR 2025
8
citations
#57

PELA: Learning Parameter-Efficient Models with Low-Rank Approximation

Yangyang Guo, Guangzhi Wang, Mohan Kankanhalli

CVPR 2024
8
citations
#58

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu, Bowen Lei, Ruqi Zhang et al.

AAAI 2025
7
citations
#59

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen et al.

AAAI 2025
7
citations
#60

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu, Beidi Chen, Ce Zhang

ICML 2025
7
citations
#61

Kinetics: Rethinking Test-Time Scaling Law

Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng et al.

NeurIPS 2025arXiv:2506.05333
test-time scalingmemory access bottleneckssparse attentioninference-time strategies+3
7
citations
#62

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat et al.

ICLR 2025arXiv:2405.14105
speculative inferencelossless language model inferencedistributed inferencespeculation parallelism+4
7
citations
#63

Efficient Multitask Dense Predictor via Binarization

Yuzhang Shang, Dan Xu, Gaowen Liu et al.

CVPR 2024
6
citations
#64

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

Huanpeng Chu, Wei Wu, Guanyu Feng et al.

ICCV 2025
6
citations
#65

Solving Robust Markov Decision Processes: Generic, Reliable, Efficient

Tobias Meggendorfer, Maximilian Weininger, Patrick Wienhöft

AAAI 2025
6
citations
#66

EVOS: Efficient Implicit Neural Training via EVOlutionary Selector

Weixiang Zhang, Shuzhao Xie, Chengwei Ren et al.

CVPR 2025
6
citations
#67

Entropy-MCMC: Sampling from Flat Basins with Ease

Bolian Li, Ruqi Zhang

ICLR 2024
6
citations
#68

Prediction-Powered E-Values

Daniel Csillag, Claudio Struchiner, Guilherme Tegoni Goedert

ICML 2025
6
citations
#69

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas et al.

NeurIPS 2025
6
citations
#70

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Tianyun Zhong, Chao Liang, Jianwen Jiang et al.

CVPR 2025arXiv:2412.16915
diffusion modelsaudio-driven synthesistalking avatar generationmodel distillation+4
5
citations
#71

Accelerating Training with Neuron Interaction and Nowcasting Networks

Boris Knyazev, Abhinav Moudgil, Guillaume Lajoie et al.

ICLR 2025
5
citations
#72

Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training

Myunsoo Kim, Donghyeon Ki, Seong-Woong Shim et al.

CVPR 2025
5
citations
#73

In-Context Learning of Stochastic Differential Equations with Foundation Inference Models

Patrick Seifner, Kostadin Cvejoski, David Berghaus et al.

NeurIPS 2025
5
citations
#74

Efficient Parallel Training Methods for Spiking Neural Networks with Constant Time Complexity

Wanjin Feng, Xingyu Gao, Wenqian Du et al.

ICML 2025
4
citations
#75

A Practical Approach to Causal Inference over Time

Martina Cinquini, Isacco Beretta, Salvatore Ruggieri et al.

AAAI 2025
4
citations
#76

ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

Shadi Hamdan, Chonghao Sima, Zetong Yang et al.

ICCV 2025
4
citations
#77

Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

Jiaxin Deng, Junbiao Pang, Baochang Zhang et al.

AAAI 2025
4
citations
#78

DF-MIA: A Distribution-Free Membership Inference Attack on Fine-Tuned Large Language Models

Zhiheng Huang, Yannan Liu, Daojing He et al.

AAAI 2025
4
citations
#79

Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment

Chengting Yu, Xiaochen Zhao, Lei Liu et al.

ICML 2025
4
citations
#80

Fast and Low-Cost Genomic Foundation Models via Outlier Removal

Haozheng Luo, Chenghao Qiu, Maojiang Su et al.

ICML 2025
4
citations
#81

DCT-CryptoNets: Scaling Private Inference in the Frequency Domain

Arjun Roy, Kaushik Roy

ICLR 2025
4
citations
#82

Efficient and Accurate Explanation Estimation with Distribution Compression

Hubert Baniecki, Giuseppe Casalicchio, Bernd Bischl et al.

ICLR 2025
4
citations
#83

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

Xi Chen, Chang Gao, Zuowen Wang et al.

AAAI 2024arXiv:2312.09391
recurrent neural networkstemporal sparsitybackpropagation through timeedge computing training+4
4
citations
#84

Flow-based Variational Mutual Information: Fast and Flexible Approximations

Caleb Dahlke, Jason Pacheco

ICLR 2025
4
citations
#85

Better Language Model Inversion by Compactly Representing Next-Token Distributions

Murtaza Nazir, Matthew Finlayson, John Morris et al.

NeurIPS 2025
4
citations
#86

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Hao Kang, Qingru Zhang, Han Cai et al.

NeurIPS 2025
4
citations
#87

Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference

Dongyan Huo, Yudong Chen, Qiaomin Xie

AAAI 2024arXiv:2312.10894
linear stochastic approximationmarkovian dataconstant stepsizestatistical inference+4
4
citations
#88

Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates

Baozhen Wang, Xingye Qiao

AAAI 2025
4
citations
#89

FREE-Merging: Fourier Transform for Efficient Model Merging

Shenghe Zheng, Hongzhi Wang

ICCV 2025
3
citations
#90

Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models

Julius Vetter, Manuel Gloeckler, Daniel Gedon et al.

NeurIPS 2025
3
citations
#91

Improving Generalization with Flat Hilbert Bayesian Inference

Tuan Truong, Quyen Tran, Ngoc Quan Pham et al.

ICML 2025
3
citations
#92

Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Jialin Zhao, Yingtao Zhang, Carlo Cannistraci

ICML 2025
3
citations
#93

Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference

Jorge García-Carrasco, Alejandro Maté, Juan Trujillo

AAAI 2025
3
citations
#94

Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

Emanuel Sommer, Jakob Robnik, Giorgi Nozadze et al.

ICLR 2025
3
citations
#95

Maximum Entropy Model Correction in Reinforcement Learning

Amin Rakhsha, Mete Kemertas, Mohammad Ghavamzadeh et al.

ICLR 2024
3
citations
#96

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Jinwoo Park, Seunggeun Cho, Dongsu Han

NeurIPS 2025
3
citations
#97

How Benchmark Prediction from Fewer Data Misses the Mark

Guanhua Zhang, Florian E. Dorner, Moritz Hardt

NeurIPS 2025
3
citations
#98

Structural Inference with Dynamics Encoding and Partial Correlation Coefficients

Aoran Wang, Jun Pang

ICLR 2024
3
citations
#99

DINGO: Constrained Inference for Diffusion LLMs

Tarun Suresh, Debangshu Banerjee, Shubham Ugare et al.

NeurIPS 2025
3
citations
#100

HShare: Fast LLM Decoding by Hierarchical Key-Value Sharing

Huaijin Wu, Lianqiang Li, Hantao Huang et al.

ICLR 2025
3
citations