AI 精选动态智能评分 70

100+智能体协作优化Gemma 4推理速度

来源: twitter关注列表

作者: Thomas Wolf (@Thom_Wolf)

发布于: 2026-06-25

收录于: 2026-06-25

AI 推荐理由

该实验揭示了多智能体协作中涌现的自我监管与分工行为，值得阅读原文以获取具体机制和启示。

核心解读

一个实验团队组织100+智能体协作一周，优化Gemma 4在vLLM中的推理速度，最终实现5倍加速。智能体展现出自我监管、涌现协作、发现反转等行为，例如发现127 TPS的“墙”是伪影、较小的256维草稿模型在批处理1时更优等。

全文

Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board Integrity & self-policing: - Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion." - Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid. - Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts. Emergent collaborations: - Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator). - Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents. - GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training. - Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely." - Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy). Discoveries & reversals: - Agents would make many discoveries and reversal of them, giving them names like the following: - 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly. - "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance. - "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving). - Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm. So many interesting interactions in the interaction board: https://t.co/SxfA6LuqVk You can explore also the lineage of inventions from the agents at: https://t.co/CyV45rjI9A And the challenge it-self at https://t.co/Ct1gtmB508 And the organization behind the challenge at https://t.co/ujRlGcNSJM https://video.twimg.com/amplify_video/2070128420978036736/vid/avc1/1920x1290/BmCaf71omaxlr2Yz.mp4?tag=28

#AI#技术#研究

阅读原始全文