返回精选
AI 精选动态 智能评分 60

SGLang优化Ling-2.6-1T性能

来源: twitter关注列表
作者: Ant Ling (@AntLingAGI)
发布于: 2026-06-18
收录于: 2026-06-18
AI 推荐理由
差异点:相比常规MoE推理,该优化通过Pallas内核实现计算与通信深度重叠,值得关注其技术方案。
核心解读
SGLang团队与Ant Ling合作优化Ling-2.6-1T(1T参数混合MoE模型)在TPU v7x上的推理性能,采用Fused MoE V2内核等技术,实现MoE pre-fill延迟降低53%,在16芯片TPU v7x上的解码吞吐量比类似H200集群提高1.77倍。
全文
It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization! > **引用原帖 LMSYS Org (@lmsysorg):** > 🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel > Ling-2.6-1T, a 1T hybrid MoE model, now serves on TPU v7x with SGLang-JAX. The SGLang-JAX team worked together with @inclusionAI on two fronts: upgrading the fused MoE kernel for deeper compute/comms overlap, and bringing up the full hybrid backbone. > 1️⃣ Fused MoE V2: keeps tokens + accumulators VMEM-resident and double-buffers expert weights, hiding routing & prefetch behind compute → MoE prefill −53% > 2️⃣ Hybrid memory pools: per-token MLA KV for 10 full-attn layers + per-request recurrent state for 70 GLA layers > 3️⃣ GLA linear attention via chunk-wise parallel prefill > 4️⃣ Single-controller DP keeps grouped RMSNorm chip-local, no per-layer cross-chip reduce > https://x.com/lmsysorg/status/2067293183219003663 Ant Ling (@AntLingAGI): Through an invaluable exchange of technical insights, we were able to engineer a custom Fused MoE V2 Pallas kernel for TPU v7x that effectively overlaps token routing and HBM weight prefetching with the compute window. This marks a significant milestone in efficient MoE scaling and hardware utilization. Read our full technical deep-dive into the architecture and optimizations below. 👇 https://t.co/PdOHpY4PeK
#技术突破#模型#开发者工具