AI 精选动态
智能评分 60
elie 深度分析 Fugu 技术报告
AI 推荐理由
本文揭示了 Fugu 在基准测试和透明度上的关键缺陷,建议阅读原文以了解技术细节。核心解读
elie 分析了 Fugu 技术报告,指出非超版本实为分类器/路由器,在 SWE Bench pro 上比 Opus 低 10 分,其他基准略有提升但成本未公开;超版本是 5 步计划模式,不透明且未报告输出 token 和成本。
全文
fantastic in depth:
https://x.com/eliebakouch/status/2068939729811468503?s=20
> **引用原帖 elie (@eliebakouch):**
> to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty"
> i've also read the tech report to get an opinion on the technical stuff:
> fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier
> about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)
> the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task
> the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release
> https://x.com/eliebakouch/status/2068939729811468503