AI 精选动态
智能评分 60
GLM-5.2 4-bit量化加速
AI 推荐理由
此量化方案保留了MTP投机解码头,在低batch下显著加速,值得参考实现。核心解读
Canada Quant Labs 将 GLM-5.2 (744B MoE) 的4-bit量化版本,保留MTP draft head BF16,质量匹配FP8,仅需4×H200即可运行,在batch-1下比AWQ/NVFP4快69-79%。
全文
Canadian bro so low key, this GLM-5.2-MTP tweet got zero likes?? Claimed 40% higher tok/s.
> **引用原帖 Canada Quant Labs (@canadaquant):**
> We quantized GLM-5.2 (744B MoE) to 4-bit — and kept its MTP draft head in BF16.
> → Matches the FP8 release on quality
> → Runs on 4×H200 instead of 8
> → Fastest 4-bit GLM-5.2 at int conc: +69–79% vs AWQ / NVFP4 at batch-1, from MTP speculative decoding
> 👇
> https://t.co/QunrvTrmfb
> https://x.com/canadaquant/status/2071435463928525108