首页|KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

来源：

英文摘要

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4$\times$ and FlashAttention decoding latency by approximately 2$\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

作者：Jang-Hyun Kim、Jinuk Kim、Sangwoo Kwon、Jae W. Lee、Sangdoo Yun、Hyun Oh Song

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Jang-Hyun Kim,Jinuk Kim,Sangwoo Kwon,Jae W. Lee,Sangdoo Yun,Hyun Oh Song.KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction[EB/OL].(2025-05-29)[2025-07-16].https://arxiv.org/abs/2505.23416.点此复制

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

评论