"kv cache compression" Papers
15 papers found
Conference
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Xiang Liu, Zhenheng Tang, Peijie Dong et al.
NEURIPS 2025arXiv:2502.00299
16
citations
Inference-Time Hyper-Scaling with KV Cache Compression
Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot et al.
NEURIPS 2025arXiv:2506.05345
17
citations
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Minsoo Kim, Kyuhong Shim, Jungwook Choi et al.
NEURIPS 2025oralarXiv:2506.15745
16
citations
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon et al.
NEURIPS 2025oralarXiv:2505.23416
17
citations
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li, Zigeng Chen, Cheng-Yen Yang et al.
NEURIPS 2025arXiv:2505.19602
9
citations
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Siyuan Li, Luyuan Zhang, Zedong Wang et al.
CVPR 2025arXiv:2504.00999
7
citations
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin et al.
ICLR 2025arXiv:2407.15891
62
citations
Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu, Yizhong Wang, Guangxuan Xiao et al.
ICLR 2025arXiv:2404.15574
150
citations
SALS: Sparse Attention in Latent Space for KV Cache Compression
Junlin Mu, Hantao Huang, Jihang Zhang et al.
NEURIPS 2025arXiv:2510.24273
Tensor Product Attention Is All You Need
Yifan Zhang, Yifeng Liu, Huizhuo Yuan et al.
NEURIPS 2025spotlightarXiv:2501.06425
34
citations
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Zhenheng Tang, Xiang Liu, Qian Wang et al.
ICLR 2025arXiv:2502.17535
11
citations
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Chaofan Lin, Jiaming Tang, Shuo Yang et al.
NEURIPS 2025spotlightarXiv:2502.02770
14
citations
CaM: Cache Merging for Memory-efficient LLMs Inference
Yuxin Zhang, Yuxuan Du, Gen Luo et al.
ICML 2024
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
Harry Dong, Xinyu Yang, Zhenyu Zhang et al.
ICML 2024arXiv:2402.09398
78
citations
LoCoCo: Dropping In Convolutions for Long Context Compression
Ruisi Cai, Yuandong Tian, Zhangyang “Atlas” Wang et al.
ICML 2024arXiv:2406.05317
16
citations