返回精选
AI 精选动态 智能评分 60

GLM-5.2 4-bit量化加速

来源: twitter关注列表
作者: Han Xiao (@hxiao)
发布于: 2026-06-29
收录于: 2026-06-29
AI 推荐理由
此量化方案保留了MTP投机解码头,在低batch下显著加速,值得参考实现。
核心解读
Canada Quant Labs 将 GLM-5.2 (744B MoE) 的4-bit量化版本,保留MTP draft head BF16,质量匹配FP8,仅需4×H200即可运行,在batch-1下比AWQ/NVFP4快69-79%。
全文
Canadian bro so low key, this GLM-5.2-MTP tweet got zero likes?? Claimed 40% higher tok/s. > **引用原帖 Canada Quant Labs (@canadaquant):** > We quantized GLM-5.2 (744B MoE) to 4-bit — and kept its MTP draft head in BF16. > → Matches the FP8 release on quality > → Runs on 4×H200 instead of 8 > → Fastest 4-bit GLM-5.2 at int conc: +69–79% vs AWQ / NVFP4 at batch-1, from MTP speculative decoding > 👇 > https://t.co/QunrvTrmfb > https://x.com/canadaquant/status/2071435463928525108
#技术更新#模型#开发者工具