返回精选
AI 精选动态 智能评分 60

Gemma-4-26B 16路并行推理演示

来源: twitter关注列表
作者: Google Gemma (@googlegemma)
发布于: 2026-06-23
收录于: 2026-06-23
AI 推荐理由
展示了 MoE 模型在统一内存设备上的并行推理性能,未使用 flashinfer 仍有 18 tok/s,值得关注该模型在消费级硬件的部署潜力。
核心解读
Onur Solmaz 在单台 DGX Spark(128GB 统一内存)上成功运行 NVIDIA 的 Gemma-4-26B-A4B-NVFP4 模型,实现 16 路并行推理,单路 18 tokens/s,聚合 300 tokens/s,最高可扩展至 32 路,且尚未使用 flashinfer 优化。
全文
Model link: https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4 Original post by @onusoz , https://x.com/onusoz/status/2067489871376364023 > **引用原帖 Onur Solmaz (@onusoz):** > 16x parallel Gemma-4-26B-A4B-NVFP4 runs 🤯🤯🤯 > 18 output tokens/s, aggregate 300 tok/s 🫪 > 1 DGX Spark with 128 GB unified memory > Concurrency so high I had to demo it programmatically > It can go up to 32 even! 🤯 But then my screen would not have been readable for you > And this is not even using flashinfer yet! Please reply if you know whether support is on the way > Note that this is not dumb e4b or e2b that you can run on the average laptop. This is the big Gemma MoE > Model link: https://t.co/JWh2R0alaQ > https://x.com/onusoz/status/2067489871376364023
#AI#模型#技术