AI 精选动态
智能评分 92
JetSpec实现1000 TPS推理性能:20亿参数伦理模型首次达到单卡千TPS量级
AI 推荐理由
量化超越当前推理解方方案,值得复现与部署验证核心解读
Hao AI Lab的JetSpec项目通过推测性解码在cartesian tree搜索期间并行生成多个推理路径,基于CUDA图优化实现1000 TPS单卡Qwen3-8B推理,在MATH-500基准测试中达到964%端到端速度提升和458×开放式对话生成速率,技术突破了当前8亿模型100 TPS的行业标准。
全文
Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.
JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️
Check out our project page for demos and a blog post on how we built it 👇
https://t.co/M4T8jOBWQ8
https://t.co/h9uipDbTuh
https://video.twimg.com/amplify_video/2070225027916812288/vid/avc1/860x544/_tThGGv5v7n2JS6r.mp4?tag=28
Hao AI Lab (@haoailab): [7/8] JetSpec also comes with a serving integration.
We integrate JetSpec into vLLM and we also provide a lightweight serving engine, which organizes candidates as speculative trees, pass tree metadata into verification, and implement paged FlashAttention kernels that apply the tree mask directly without materializing dense masks.
On a single B200 with Qwen3-8B, MATH-500, batch 1, budget 128, JetSpec reaches an average of around 1000 TPS as shown in the demo.
Hao AI Lab (@haoailab): [8/8] Our paper, code and checkpoints are now available on Github and HF!
We also provide a vLLM integration and a lightweight inference engine if you want to test it on production-level environment. Give it a try and we would love to hear your feedbacks! 😎
Project: https://t.co/M4T8jOBWQ8
GitHub: https://t.co/Oj92K5fwAR
Models: https://t.co/jJxWJ8YjYC
Blog: https://t.co/h9uipDbTuh
Paper: https://t.co/r3H8o01870