返回精选
AI 精选动态 智能评分 75

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

来源: twitter关注列表
作者: Rohan Paul (@rohanpaul_ai)
发布于: 2026-06-03
收录于: 2026-06-03
AI 推荐理由
Quantifies memory reduction (10%-33.7%) and speedup (2.1-4.6x) with specific methodology for KV pruning, offering actionable insights for efficient LLM deployment.
核心解读
This paper introduces a method for LLMs to save memory by pruning key-value cache using a predictor that scores token usefulness. The model keeps recent tokens and older tokens only when their score exceeds a threshold. Trained via next-token prediction, it achieves 10%-33.7% KV entry retention while matching performance and delivering 2.1-4.6x decoding speedup in long-context scenarios.
#研究突破#技术#模型