"memory-efficient inference" Papers
6 papers found
Computation and Memory-Efficient Model Compression with Gradient Reweighting
Zhiwei Li, Yuesen Liao, Binrui Wu et al.
NeurIPS 2025poster
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li, Zigeng Chen, Cheng-Yen Yang et al.
NeurIPS 2025posterarXiv:2505.19602
6
citations
Surprising Effectiveness of pretraining Ternary Language Model at Scale
Ayush Kaushal, Tejas Vaidhya, Arnab Mondal et al.
ICLR 2025posterarXiv:2407.12327
13
citations
UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression
Chenlong Deng, Zhisong Zhang, Kelong Mao et al.
NeurIPS 2025posterarXiv:2509.15763
1
citations
CaM: Cache Merging for Memory-efficient LLMs Inference
Yuxin Zhang, Yuxuan Du, Gen Luo et al.
ICML 2024poster
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin et al.
ICML 2024poster