返回精选
AI 精选动态 智能评分 65

LLM-as-a-Judge 可靠性审计

来源: twitter关注列表
作者: elvis (@omarsar0)
发布于: 2026-06-22
收录于: 2026-06-22
AI 推荐理由
揭示了高重复性与高偏见可共存的“一致性悖论”,建议在构建评估系统时使用 Cohen's kappa 以修正由于随机性导致的一致性高估。
核心解读
研究人员对 9 家供应商的 21 个 LLM-as-a-Judge 模型在 MT-Bench、JudgeBench 和 RewardBench 上约 541,000 次评判进行了可靠性审计。研究发现使用 Cohen's kappa 指标替代精确匹配 (exact-match) 会使 MT-Bench 的一致性得分降低 33-41 分,且模型排名最高波动 14 位。此外,研究指出存在一致性悖论,即测试-重测可靠性高于 0.95 的模型仍可能携带超过 0.10 的严重位置偏见。
全文
elvis (@omarsar0) 转发了 DAIR.AI (@dair_ai) 的帖子: The largest LLM-as-a-Judge reliability audit yet. Researchers ran 21 judges from nine providers over roughly 541,000 judgments on MT-Bench, JudgeBench, and RewardBench. Findings: Validating a judge with exact-match agreement overstates its skill, because exact match does not correct for chance. Switching to Cohen's kappa deflates agreement by 33 to 41 points on MT-Bench, and judge rankings move by up to 14 positions across benchmarks. There is also a consistency paradox. Two production-deployed judges score above 0.95 test-retest reliability while carrying severe position bias above 0.10, so a judge can agree with itself every time and still be wrong in the same direction every time. Paper: https://t.co/Jh8U1R2svQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c ![photo](https://pbs.twimg.com/media/HLbKeGKbkAAGZUh.jpg)
#研究#基准测试#大模型