AI 精选动态智能评分 60

JetSpec 推测解码技术发布

来源: twitter关注列表

作者: Hao AI Lab (@haoailab)

发布于: 2026-06-25

收录于: 2026-06-28

AI 推荐理由

差异点在于其因果并行树结构与实际部署中的高性能数据，值得阅读原文了解实现细节。

核心解读

Hao AI Lab 提出 JetSpec，通过保持因果性和轻量级因果并行解码头扩展推测解码树，在 B200 GPU 上对 Qwen3-8B 和 Qwen3-30B-A3B 实现高达 1000 TPS，MATH-500 上端到端加速 9.64 倍，开放聊天加速 4.58 倍，且保持无损。

全文

Hao AI Lab (@haoailab) 转发了 Lanxiang Hu (@Lanxiang_Hu) 的帖子： We introduce JetSpec, which pushes speculative decoding acceptance length and TPS to a new frontier. We find that (1) preserving causality and (2) employing lightweight causal parallel decoding heads are key to scaling speculative decoding with trees, thereby converting sparse FLOPs more effectively into real end-to-end speedup. With serving-engine support, JetSpec achieves up to 1000 TPS on B200 GPUs for both Qwen3-8B and Qwen3-30B-A3B. Check out our project page and blog for more details! Project page: https://t.co/9ebBhoMWBI Blog: https://t.co/N0wDKsUBHT > **引用原帖 Hao AI Lab (@haoailab):** > Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. > JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️ > Check out our project page for demos and a blog post on how we built it 👇 > https://t.co/M4T8jOBWQ8 > https://t.co/h9uipDbTuh > https://x.com/haoailab/status/2070225035403694408

#技术突破#创新

阅读原始全文