AI 精选动态智能评分 60

GLM-5.2 4-bit量化加速

来源: twitter关注列表

作者: Han Xiao (@hxiao)

发布于: 2026-06-29

收录于: 2026-06-29

AI 推荐理由

此量化方案保留了MTP投机解码头，在低batch下显著加速，值得参考实现。

核心解读

Canada Quant Labs 将 GLM-5.2 (744B MoE) 的4-bit量化版本，保留MTP draft head BF16，质量匹配FP8，仅需4×H200即可运行，在batch-1下比AWQ/NVFP4快69-79%。

全文

Canadian bro so low key, this GLM-5.2-MTP tweet got zero likes?? Claimed 40% higher tok/s. > **引用原帖 Canada Quant Labs (@canadaquant):** > We quantized GLM-5.2 (744B MoE) to 4-bit — and kept its MTP draft head in BF16. > → Matches the FP8 release on quality > → Runs on 4×H200 instead of 8 > → Fastest 4-bit GLM-5.2 at int conc: +69–79% vs AWQ / NVFP4 at batch-1, from MTP speculative decoding > 👇 > https://t.co/QunrvTrmfb > https://x.com/canadaquant/status/2071435463928525108

#技术更新#模型#开发者工具

阅读原始全文