Poster "kv cache compression" Papers
11 papers found
Conference
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Xiang Liu, Zhenheng Tang, Peijie Dong et al.
NEURIPS 2025arXiv:2502.00299
16
citations
Inference-Time Hyper-Scaling with KV Cache Compression
Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot et al.
NEURIPS 2025arXiv:2506.05345
17
citations
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li, Zigeng Chen, Cheng-Yen Yang et al.
NEURIPS 2025arXiv:2505.19602
9
citations
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Siyuan Li, Luyuan Zhang, Zedong Wang et al.
CVPR 2025arXiv:2504.00999
7
citations
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin et al.
ICLR 2025arXiv:2407.15891
62
citations
Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu, Yizhong Wang, Guangxuan Xiao et al.
ICLR 2025arXiv:2404.15574
150
citations
SALS: Sparse Attention in Latent Space for KV Cache Compression
Junlin Mu, Hantao Huang, Jihang Zhang et al.
NEURIPS 2025arXiv:2510.24273
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Zhenheng Tang, Xiang Liu, Qian Wang et al.
ICLR 2025arXiv:2502.17535
11
citations
CaM: Cache Merging for Memory-efficient LLMs Inference
Yuxin Zhang, Yuxuan Du, Gen Luo et al.
ICML 2024
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
Harry Dong, Xinyu Yang, Zhenyu Zhang et al.
ICML 2024arXiv:2402.09398
79
citations
LoCoCo: Dropping In Convolutions for Long Context Compression
Ruisi Cai, Yuandong Tian, Zhangyang “Atlas” Wang et al.
ICML 2024arXiv:2406.05317
16
citations