AI 精选动态
智能评分 60
DeepSeek 发布 DSpark 推理优化方法
AI 推荐理由
DSpark 引入选择性验证和置信度调度,解决推测解码中长草稿块验证无效的问题,值得关注推理优化方向。核心解读
DeepSeek 发布 DSpark,一种半并行推测解码系统,在 DeepSeek-V4 上实现每用户生成速度提升 60%-85%。核心创新是选择性验证草稿 token,使用并行草稿模型和马尔可夫头调整,并引入置信度调度根据接受概率和 GPU 负载决定验证数量。
全文
Fantastic, @deepseek_ai just published their new inference optimization method.
Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput.
The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking.
Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass.
The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity.
DSpark’s breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off.
The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it.
That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was.
i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block.
Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.

Rohan Paul (@rohanpaul_ai): Paper: https://t.co/RWChEJCoeV
Github: https://t.co/mbcGp0bXW1
HF: https://t.co/JEIN9WDknv