返回精选
AI 精选动态 智能评分 92

JetSpec实现1000 TPS推理性能:20亿参数伦理模型首次达到单卡千TPS量级

来源: twitter关注列表
作者: Hao AI Lab (@haoailab)
发布于: 2026-06-25
收录于: 2026-06-25
AI 推荐理由
量化超越当前推理解方方案,值得复现与部署验证
核心解读
Hao AI Lab的JetSpec项目通过推测性解码在cartesian tree搜索期间并行生成多个推理路径,基于CUDA图优化实现1000 TPS单卡Qwen3-8B推理,在MATH-500基准测试中达到964%端到端速度提升和458×开放式对话生成速率,技术突破了当前8亿模型100 TPS的行业标准。
全文
Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️ Check out our project page for demos and a blog post on how we built it 👇 https://t.co/M4T8jOBWQ8 https://t.co/h9uipDbTuh https://video.twimg.com/amplify_video/2070225027916812288/vid/avc1/860x544/_tThGGv5v7n2JS6r.mp4?tag=28 Hao AI Lab (@haoailab): [7/8] JetSpec also comes with a serving integration. We integrate JetSpec into vLLM and we also provide a lightweight serving engine, which organizes candidates as speculative trees, pass tree metadata into verification, and implement paged FlashAttention kernels that apply the tree mask directly without materializing dense masks. On a single B200 with Qwen3-8B, MATH-500, batch 1, budget 128, JetSpec reaches an average of around 1000 TPS as shown in the demo. Hao AI Lab (@haoailab): [8/8] Our paper, code and checkpoints are now available on Github and HF! We also provide a vLLM integration and a lightweight inference engine if you want to test it on production-level environment. Give it a try and we would love to hear your feedbacks! 😎 Project: https://t.co/M4T8jOBWQ8 GitHub: https://t.co/Oj92K5fwAR Models: https://t.co/jJxWJ8YjYC Blog: https://t.co/h9uipDbTuh Paper: https://t.co/r3H8o01870
#模型发布#技术突破#AI产业