AI 精选动态智能评分 65

LLM-as-a-Judge 可靠性审计

来源: twitter关注列表

作者: elvis (@omarsar0)

发布于: 2026-06-22

收录于: 2026-06-22

AI 推荐理由

揭示了高重复性与高偏见可共存的“一致性悖论”，建议在构建评估系统时使用 Cohen's kappa 以修正由于随机性导致的一致性高估。

核心解读

研究人员对 9 家供应商的 21 个 LLM-as-a-Judge 模型在 MT-Bench、JudgeBench 和 RewardBench 上约 541,000 次评判进行了可靠性审计。研究发现使用 Cohen's kappa 指标替代精确匹配 (exact-match) 会使 MT-Bench 的一致性得分降低 33-41 分，且模型排名最高波动 14 位。此外，研究指出存在一致性悖论，即测试-重测可靠性高于 0.95 的模型仍可能携带超过 0.10 的严重位置偏见。

全文

elvis (@omarsar0) 转发了 DAIR.AI (@dair_ai) 的帖子： The largest LLM-as-a-Judge reliability audit yet. Researchers ran 21 judges from nine providers over roughly 541,000 judgments on MT-Bench, JudgeBench, and RewardBench. Findings: Validating a judge with exact-match agreement overstates its skill, because exact match does not correct for chance. Switching to Cohen's kappa deflates agreement by 33 to 41 points on MT-Bench, and judge rankings move by up to 14 positions across benchmarks. There is also a consistency paradox. Two production-deployed judges score above 0.95 test-retest reliability while carrying severe position bias above 0.10, so a judge can agree with itself every time and still be wrong in the same direction every time. Paper: https://t.co/Jh8U1R2svQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c ![photo](https://pbs.twimg.com/media/HLbKeGKbkAAGZUh.jpg)

#研究#基准测试#大模型

阅读原始全文