AI 精选动态
智能评分 60
SGLang优化Ling-2.6-1T性能
AI 推荐理由
差异点:相比常规MoE推理,该优化通过Pallas内核实现计算与通信深度重叠,值得关注其技术方案。核心解读
SGLang团队与Ant Ling合作优化Ling-2.6-1T(1T参数混合MoE模型)在TPU v7x上的推理性能,采用Fused MoE V2内核等技术,实现MoE pre-fill延迟降低53%,在16芯片TPU v7x上的解码吞吐量比类似H200集群提高1.77倍。
全文
It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳
The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency
-Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster
A significant milestone in efficient MoE scaling and hardware utilization!
> **引用原帖 LMSYS Org (@lmsysorg):**
> 🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel
> Ling-2.6-1T, a 1T hybrid MoE model, now serves on TPU v7x with SGLang-JAX. The SGLang-JAX team worked together with @inclusionAI on two fronts: upgrading the fused MoE kernel for deeper compute/comms overlap, and bringing up the full hybrid backbone.
> 1️⃣ Fused MoE V2: keeps tokens + accumulators VMEM-resident and double-buffers expert weights, hiding routing & prefetch behind compute → MoE prefill −53%
> 2️⃣ Hybrid memory pools: per-token MLA KV for 10 full-attn layers + per-request recurrent state for 70 GLA layers
> 3️⃣ GLA linear attention via chunk-wise parallel prefill
> 4️⃣ Single-controller DP keeps grouped RMSNorm chip-local, no per-layer cross-chip reduce
> https://x.com/lmsysorg/status/2067293183219003663
Ant Ling (@AntLingAGI): Through an invaluable exchange of technical insights, we were able to engineer a custom Fused MoE V2 Pallas kernel for TPU v7x that effectively overlaps token routing and HBM weight prefetching with the compute window.
This marks a significant milestone in efficient MoE scaling and hardware utilization. Read our full technical deep-dive into the architecture and optimizations below. 👇
https://t.co/PdOHpY4PeK