AI 精选动态
智能评分 75
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
AI 推荐理由
Quantifies memory reduction (10%-33.7%) and speedup (2.1-4.6x) with specific methodology for KV pruning, offering actionable insights for efficient LLM deployment.核心解读
This paper introduces a method for LLMs to save memory by pruning key-value cache using a predictor that scores token usefulness. The model keeps recent tokens and older tokens only when their score exceeds a threshold. Trained via next-token prediction, it achieves 10%-33.7% KV entry retention while matching performance and delivering 2.1-4.6x decoding speedup in long-context scenarios.