AI 精选动态智能评分 60

SGLang优化Ling-2.6-1T性能

来源: twitter关注列表

作者: Ant Ling (@AntLingAGI)

发布于: 2026-06-18

收录于: 2026-06-18

AI 推荐理由

差异点：相比常规MoE推理，该优化通过Pallas内核实现计算与通信深度重叠，值得关注其技术方案。

核心解读

SGLang团队与Ant Ling合作优化Ling-2.6-1T（1T参数混合MoE模型）在TPU v7x上的推理性能，采用Fused MoE V2内核等技术，实现MoE pre-fill延迟降低53%，在16芯片TPU v7x上的解码吞吐量比类似H200集群提高1.77倍。

全文

It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization! > **引用原帖 LMSYS Org (@lmsysorg):** > 🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel > Ling-2.6-1T, a 1T hybrid MoE model, now serves on TPU v7x with SGLang-JAX. The SGLang-JAX team worked together with @inclusionAI on two fronts: upgrading the fused MoE kernel for deeper compute/comms overlap, and bringing up the full hybrid backbone. > 1️⃣ Fused MoE V2: keeps tokens + accumulators VMEM-resident and double-buffers expert weights, hiding routing & prefetch behind compute → MoE prefill −53% > 2️⃣ Hybrid memory pools: per-token MLA KV for 10 full-attn layers + per-request recurrent state for 70 GLA layers > 3️⃣ GLA linear attention via chunk-wise parallel prefill > 4️⃣ Single-controller DP keeps grouped RMSNorm chip-local, no per-layer cross-chip reduce > https://x.com/lmsysorg/status/2067293183219003663 Ant Ling (@AntLingAGI): Through an invaluable exchange of technical insights, we were able to engineer a custom Fused MoE V2 Pallas kernel for TPU v7x that effectively overlaps token routing and HBM weight prefetching with the compute window. This marks a significant milestone in efficient MoE scaling and hardware utilization. Read our full technical deep-dive into the architecture and optimizations below. 👇 https://t.co/PdOHpY4PeK

#技术突破#模型#开发者工具

阅读原始全文