Efficient Inference
Fast and efficient model inference
Top Papers
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu et al.
Training Language Models to Reason Efficiently
Daman Arora, Andrea Zanette
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
Yangzhen Wu, Zhiqing Sun, Shanda Li et al.
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed
Yifan Wang, Xingyi He, Sida Peng et al.
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
Han Zhao, Min Zhang, Wei Zhao et al.
DEIM: DETR with Improved Matching for Fast Convergence
Shihua Huang, Zhichao Lu, Xiaodong Cun et al.
Consistency Models Made Easy
Zhengyang Geng, Ashwini Pokle, Weijian Luo et al.
dKV-Cache: The Cache for Diffusion Language Models
Xinyin Ma, Runpeng Yu, Gongfan Fang et al.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.
Inductive Moment Matching
Linqi (Alex) Zhou, Stefano Ermon, Jiaming Song
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai, Jianqiao Lu, Yao Luo et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia, Yongqi Li, Jun Zhang et al.
SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction
Yang Zhou, Hao Shao, Letian Wang et al.
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Yiming Wang, Pei Zhang, Siyuan Huang et al.
TinySAM: Pushing the Envelope for Efficient Segment Anything Model
Han Shu, Wenshuo Li, Yehui Tang et al.
LION: Implicit Vision Prompt Tuning
Haixin Wang, Jianlong Chang, Yihang Zhai et al.
Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Barys Liskavets, Maxim Ushakov, Shuvendu Roy et al.
Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed
Yubin Xiao, Di Wang, Boyang Li et al.
Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms
Yinuo Ren, Haoxuan Chen, Yuchen Zhu et al.
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Le Zhuo, Liangbing Zhao, Sayak Paul et al.
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
Yizhe Xiong, Hui Chen, Tianxiang Hao et al.
HyperFast: Instant Classification for Tabular Data
David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto et al.
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
ZUYAN LIU, Benlin Liu, Jiahui Wang et al.
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Kai Wang, Mingjia Shi, YuKun Zhou et al.
$\text{D}_{2}\text{O}$: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Zhongwei Wan, Xinjian Wu, Yu Zhang et al.
Conditional Information Bottleneck Approach for Time Series Imputation
MinGyu Choi, Changhee Lee
Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians
Ishan Amin, Sanjeev Raja, Aditi Krishnapriyan
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang, Yang Sui, Jinqi Xiao et al.
Efficiently Scaling LLM Reasoning Programs with Certaindex
Yichao Fu, Junda Chen, Siqi Zhu et al.
Falcon: Faster and Parallel Inference of Large Language Models Through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang et al.
DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation
Changdae Oh, Yixuan Li, Kyungwoo Song et al.
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Xiang Liu, Zhenheng Tang, Peijie Dong et al.
The Need for Speed: Pruning Transformers with One Recipe
Samir Khaki, Konstantinos Plataniotis
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Xing Li, Zeyu Xing, Yiming Li et al.
MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks
Zhiyu Zhu, Huaming Chen, Jiayu Zhang et al.
MERGE: Fast Private Text Generation
Zi Liang, Pinghui Wang, Ruofei Zhang et al.
Efficient Inference for Large Language Model-based Generative Recommendation
Xinyu Lin, Chaoqun Yang, Wenjie Wang et al.
Scaling Inference Time Compute for Diffusion Models
Nanye Ma, Shangyuan Tong, Haolin Jia et al.
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
Kevin Li, Sachin Goyal, João D Semedo et al.
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju, Haicheng Wang, Haozhe Cheng et al.
Imputation for prediction: beware of diminishing returns.
Marine Le Morvan, Gael Varoquaux
Revisiting In-context Learning Inference Circuit in Large Language Models
Hakaze Cho, Mariko Kato, Yoshihiro Sakai et al.
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti et al.
Understanding and Improving Optimization in Predictive Coding Networks
Nicholas Alonso, Jeffrey Krichmar, Emre Neftci
Variational Inference for SDEs Driven by Fractional Noise
Rembert Daems, Manfred Opper, Guillaume Crevecoeur et al.
Colour Passing Revisited: Lifted Model Construction with Commutative Factors
Malte Luttermann, Tanya Braun, Ralf Möller et al.
Estimating Conditional Mutual Information for Dynamic Feature Selection
Soham Gadgil, Ian Covert, Su-In Lee
PowerMLP: An Efficient Version of KAN
Ruichen Qiu, Yibo Miao, Shiwen Wang et al.
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Yunhong Lu, Qichao Wang, Hengyuan Cao et al.
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor, Jonathan Mamou, Daniel Korat et al.
Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner
Mengfei Xia, Yujun Shen, Changsong Lei et al.
Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models
Reza Shirkavand, Peiran Yu, Shangqian Gao et al.
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Zhikai Li, Xuewen Liu, Dongrong Joe Fu et al.
Compositional simulation-based inference for time series
Manuel Gloeckler, Shoji Toyota, Kenji Fukumizu et al.
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
Yangyang Guo, Guangzhi Wang, Mohan Kankanhalli
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu, Bowen Lei, Ruqi Zhang et al.
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models
Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen et al.
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
Jingyu Liu, Beidi Chen, Ce Zhang
Kinetics: Rethinking Test-Time Scaling Law
Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng et al.
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
Nadav Timor, Jonathan Mamou, Daniel Korat et al.
Efficient Multitask Dense Predictor via Binarization
Yuzhang Shang, Dan Xu, Gaowen Liu et al.
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Huanpeng Chu, Wei Wu, Guanyu Feng et al.
Solving Robust Markov Decision Processes: Generic, Reliable, Efficient
Tobias Meggendorfer, Maximilian Weininger, Patrick Wienhöft
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
Weixiang Zhang, Shuzhao Xie, Chengwei Ren et al.
Entropy-MCMC: Sampling from Flat Basins with Ease
Bolian Li, Ruqi Zhang
Prediction-Powered E-Values
Daniel Csillag, Claudio Struchiner, Guilherme Tegoni Goedert
CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas et al.
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
Tianyun Zhong, Chao Liang, Jianwen Jiang et al.
Accelerating Training with Neuron Interaction and Nowcasting Networks
Boris Knyazev, Abhinav Moudgil, Guillaume Lajoie et al.
Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training
Myunsoo Kim, Donghyeon Ki, Seong-Woong Shim et al.
In-Context Learning of Stochastic Differential Equations with Foundation Inference Models
Patrick Seifner, Kostadin Cvejoski, David Berghaus et al.
Efficient Parallel Training Methods for Spiking Neural Networks with Constant Time Complexity
Wanjin Feng, Xingyu Gao, Wenqian Du et al.
A Practical Approach to Causal Inference over Time
Martina Cinquini, Isacco Beretta, Salvatore Ruggieri et al.
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
Shadi Hamdan, Chonghao Sima, Zetong Yang et al.
Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization
Jiaxin Deng, Junbiao Pang, Baochang Zhang et al.
DF-MIA: A Distribution-Free Membership Inference Attack on Fine-Tuned Large Language Models
Zhiheng Huang, Yannan Liu, Daojing He et al.
Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment
Chengting Yu, Xiaochen Zhao, Lei Liu et al.
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Haozheng Luo, Chenghao Qiu, Maojiang Su et al.
DCT-CryptoNets: Scaling Private Inference in the Frequency Domain
Arjun Roy, Kaushik Roy
Efficient and Accurate Explanation Estimation with Distribution Compression
Hubert Baniecki, Giuseppe Casalicchio, Bernd Bischl et al.
Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training
Xi Chen, Chang Gao, Zuowen Wang et al.
Flow-based Variational Mutual Information: Fast and Flexible Approximations
Caleb Dahlke, Jason Pacheco
Better Language Model Inversion by Compactly Representing Next-Token Distributions
Murtaza Nazir, Matthew Finlayson, John Morris et al.
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Hao Kang, Qingru Zhang, Han Cai et al.
Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference
Dongyan Huo, Yudong Chen, Qiaomin Xie
Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates
Baozhen Wang, Xingye Qiao
FREE-Merging: Fourier Transform for Efficient Model Merging
Shenghe Zheng, Hongzhi Wang
Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models
Julius Vetter, Manuel Gloeckler, Daniel Gedon et al.
Improving Generalization with Flat Hilbert Bayesian Inference
Tuan Truong, Quyen Tran, Ngoc Quan Pham et al.
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
Jialin Zhao, Yingtao Zhang, Carlo Cannistraci
Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference
Jorge García-Carrasco, Alejandro Maté, Juan Trujillo
Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks
Emanuel Sommer, Jakob Robnik, Giorgi Nozadze et al.
Maximum Entropy Model Correction in Reinforcement Learning
Amin Rakhsha, Mete Kemertas, Mohammad Ghavamzadeh et al.
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Jinwoo Park, Seunggeun Cho, Dongsu Han
How Benchmark Prediction from Fewer Data Misses the Mark
Guanhua Zhang, Florian E. Dorner, Moritz Hardt
Structural Inference with Dynamics Encoding and Partial Correlation Coefficients
Aoran Wang, Jun Pang
DINGO: Constrained Inference for Diffusion LLMs
Tarun Suresh, Debangshu Banerjee, Shubham Ugare et al.
HShare: Fast LLM Decoding by Hierarchical Key-Value Sharing
Huaijin Wu, Lianqiang Li, Hantao Huang et al.