"kv cache compression" Papers

15 papers found

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong et al.

NEURIPS 2025arXiv:2502.00299
16
citations

Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot et al.

NEURIPS 2025arXiv:2506.05345
17
citations

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi et al.

NEURIPS 2025oralarXiv:2506.15745
16
citations

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon et al.

NEURIPS 2025oralarXiv:2505.23416
17
citations

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Kunjun Li, Zigeng Chen, Cheng-Yen Yang et al.

NEURIPS 2025arXiv:2505.19602
9
citations

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Siyuan Li, Luyuan Zhang, Zedong Wang et al.

CVPR 2025arXiv:2504.00999
7
citations

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang, Yang Lin, Jing Lin et al.

ICLR 2025arXiv:2407.15891
62
citations

Retrieval Head Mechanistically Explains Long-Context Factuality

Wenhao Wu, Yizhong Wang, Guangxuan Xiao et al.

ICLR 2025arXiv:2404.15574
150
citations

SALS: Sparse Attention in Latent Space for KV Cache Compression

Junlin Mu, Hantao Huang, Jihang Zhang et al.

NEURIPS 2025arXiv:2510.24273

Tensor Product Attention Is All You Need

Yifan Zhang, Yifeng Liu, Huizhuo Yuan et al.

NEURIPS 2025spotlightarXiv:2501.06425
34
citations

The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?

Zhenheng Tang, Xiang Liu, Qian Wang et al.

ICLR 2025arXiv:2502.17535
11
citations

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin, Jiaming Tang, Shuo Yang et al.

NEURIPS 2025spotlightarXiv:2502.02770
14
citations

CaM: Cache Merging for Memory-efficient LLMs Inference

Yuxin Zhang, Yuxuan Du, Gen Luo et al.

ICML 2024

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Harry Dong, Xinyu Yang, Zhenyu Zhang et al.

ICML 2024arXiv:2402.09398
78
citations

LoCoCo: Dropping In Convolutions for Long Context Compression

Ruisi Cai, Yuandong Tian, Zhangyang “Atlas” Wang et al.

ICML 2024arXiv:2406.05317
16
citations