AI 精选动态智能评分 60

Claw-SWE-Bench 评测报告

来源: twitter关注列表

作者: AK (@_akhaliq)

发布于: 2026-06-15

收录于: 2026-06-17

AI 推荐理由

可帮助评估不同代码生成 harness 的成本与效果，值得在项目选型时参考。

核心解读

TokenRhythm Technologies 等联合发布 Claw-SWE-Bench 基准，使用 350 条 GitHub Issue（8 种语言、43 个仓库）统一评测代码生成 harness，结果显示仅更换 harness 成功率可从 19% 提升至 73%，不同 harness 间成功率差异最高 27 点，模型差异导致成本相差最高 170 倍，且提供 80 任务 Lite 版成本约为完整版的 23%。

全文

AK (@_akhaliq) 转发了 OpenSquilla (@OpenSquilla) 的帖子： Run the same coding tasks while varying the model and the harness (the layer wrapped around the model that actually drives it), and the spread is wild: Change only how the agent hands in its work → the success rate jumps from 19% to 73%. Change only the harness → success rates differ by up to 27 points. Change only the model → the bill can differ by up to 170×, even when the final results are just 8 points apart. You really should dig into Claw-SWE-Bench, just released on GitHub. It's the latest paper-and-benchmark jointly released by TokenRhythm Technologies, Infinigence AI, City University of Hong Kong, SEE Fund, Peking University, Shanghai Jiaotong University, Beijing Jiaotong University, and Tsinghua University — a remarkably principled benchmark that actually reflects what a harness can do. Picking the right harness matters a lot. But among OpenClaw, Hermes, ZeroClaw, GenericAgent, and NanoBot — which one is actually best at coding tasks? Gut feeling? Or a real test? And if you test, how? You test it your way, I test it mine — so how do the results even compare? Claw-SWE-Bench's point is simple: every harness reports its score bundled with its own tasks, budget, prompts, and model — so you can never tell whether a high score comes from a strong model, a strong harness, or easy problems. Claw-SWE-Bench ends this "everyone-tests-their-own-way" mess by building one shared exam that isolates the harness as the single variable being compared: Same exam paper: 350 real GitHub issues across 8 languages and 43 repositories — every harness solves the same set. Same rules: identical problem statements and the same budget (max 1 hour per task, one attempt only, fixed concurrency), all scored by the same official SWE-bench grader. The key move — judge the code, not the talk: whether a harness outputs JSON, plain text, or nothing at all, none of it counts. The grade rests solely on which files it actually changed in the repo. That's what finally lets wildly different harnesses sit at the same table. Anti-cheating: some test environments let the AI peek at "the answer from the future." The paper scrubbed all of these leaks. It scores cost, not just correctness: every harness must also report how much money it burned, how long it took, and its cache hit rate — because two setups with near-identical accuracy can have bills that differ by 100×. Adding a new harness? Just write a small adapter. Any harness that implements a handful of fixed interfaces plugs straight into the exam — no changes to the task set or grader. So it's not a one-off test of these five; it's a standard that can keep growing. It also ships an 80-task Lite version that costs only ~23% of the full run yet reproduces roughly the same rankings — handy for fast iteration. Paper & code: https://t.co/FOPh6hba6z ![photo](https://pbs.twimg.com/media/HK0JqEQbYAAZs_G.jpg) ![photo](https://pbs.twimg.com/media/HK0JsoUaUAAzFrj.jpg)

#基准测试#研究

阅读原始全文