AI 精选动态智能评分 75

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

来源: twitter关注列表

作者: Rohan Paul (@rohanpaul_ai)

发布于: 2026-06-03

收录于: 2026-06-03

AI 推荐理由

Quantifies memory reduction (10%-33.7%) and speedup (2.1-4.6x) with specific methodology for KV pruning, offering actionable insights for efficient LLM deployment.

核心解读

This paper introduces a method for LLMs to save memory by pruning key-value cache using a predictor that scores token usefulness. The model keeps recent tokens and older tokens only when their score exceeds a threshold. Trained via next-token prediction, it achieves 10%-33.7% KV entry retention while matching performance and delivering 2.1-4.6x decoding speedup in long-context scenarios.

#研究突破#技术#模型

阅读原始全文