AI 精选动态
智能评分 60
Gemma-4-26B 16路并行推理演示
AI 推荐理由
展示了 MoE 模型在统一内存设备上的并行推理性能,未使用 flashinfer 仍有 18 tok/s,值得关注该模型在消费级硬件的部署潜力。核心解读
Onur Solmaz 在单台 DGX Spark(128GB 统一内存)上成功运行 NVIDIA 的 Gemma-4-26B-A4B-NVFP4 模型,实现 16 路并行推理,单路 18 tokens/s,聚合 300 tokens/s,最高可扩展至 32 路,且尚未使用 flashinfer 优化。
全文
Model link: https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4
Original post by @onusoz , https://x.com/onusoz/status/2067489871376364023
> **引用原帖 Onur Solmaz (@onusoz):**
> 16x parallel Gemma-4-26B-A4B-NVFP4 runs 🤯🤯🤯
> 18 output tokens/s, aggregate 300 tok/s
> 1 DGX Spark with 128 GB unified memory
> Concurrency so high I had to demo it programmatically
> It can go up to 32 even! 🤯 But then my screen would not have been readable for you
> And this is not even using flashinfer yet! Please reply if you know whether support is on the way
> Note that this is not dumb e4b or e2b that you can run on the average laptop. This is the big Gemma MoE
> Model link: https://t.co/JWh2R0alaQ
> https://x.com/onusoz/status/2067489871376364023