AI 精选动态
智能评分 75
JetSpec 加速
AI 推荐理由
新增了基于 CUDA graph 的并行树草拟 decoding 方法,可在保持 lossless 的前提下显著降低推理延迟。核心解读
Hao AI Lab 介绍 JetSpec,实现相较于以往 speculative decoding 与 block diffusion 的 9.64 倍 MATH-500 与 4.58 倍开放式聊天速度提升,单卡 B200 达 1000 TPS,保持 lossless。
全文
Hao AI Lab (@haoailab) 转发了 Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) 的帖子:
I want to bring your attention to JetSpec because it looks strictly smarter and stronger than previous speculative decoding and block diffusion approaches (yes, again).
Avg 1000 t/s single stream with Qwen-8B on B200. Basically, you can better utilize compute at any batch size. https://t.co/OFK1dY8kmX

> **引用原帖 Hao AI Lab (@haoailab):**
> Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.
> JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️
> Check out our project page for demos and a blog post on how we built it 👇
> https://t.co/M4T8jOBWQ8
> https://t.co/h9uipDbTuh
> https://x.com/haoailab/status/2070225035403694408