AI 精选动态智能评分 60

Gemma-4-26B 16路并行推理演示

来源: twitter关注列表

作者: Google Gemma (@googlegemma)

发布于: 2026-06-23

收录于: 2026-06-23

AI 推荐理由

展示了 MoE 模型在统一内存设备上的并行推理性能，未使用 flashinfer 仍有 18 tok/s，值得关注该模型在消费级硬件的部署潜力。

核心解读

Onur Solmaz 在单台 DGX Spark（128GB 统一内存）上成功运行 NVIDIA 的 Gemma-4-26B-A4B-NVFP4 模型，实现 16 路并行推理，单路 18 tokens/s，聚合 300 tokens/s，最高可扩展至 32 路，且尚未使用 flashinfer 优化。

全文

Model link: https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4 Original post by @onusoz , https://x.com/onusoz/status/2067489871376364023 > **引用原帖 Onur Solmaz (@onusoz):** > 16x parallel Gemma-4-26B-A4B-NVFP4 runs 🤯🤯🤯 > 18 output tokens/s, aggregate 300 tok/s 🫪 > 1 DGX Spark with 128 GB unified memory > Concurrency so high I had to demo it programmatically > It can go up to 32 even! 🤯 But then my screen would not have been readable for you > And this is not even using flashinfer yet! Please reply if you know whether support is on the way > Note that this is not dumb e4b or e2b that you can run on the average laptop. This is the big Gemma MoE > Model link: https://t.co/JWh2R0alaQ > https://x.com/onusoz/status/2067489871376364023

#AI#模型#技术

阅读原始全文