AI 精选动态
智能评分 65
LLM-as-a-Judge 可靠性审计
AI 推荐理由
揭示了高重复性与高偏见可共存的“一致性悖论”,建议在构建评估系统时使用 Cohen's kappa 以修正由于随机性导致的一致性高估。核心解读
研究人员对 9 家供应商的 21 个 LLM-as-a-Judge 模型在 MT-Bench、JudgeBench 和 RewardBench 上约 541,000 次评判进行了可靠性审计。研究发现使用 Cohen's kappa 指标替代精确匹配 (exact-match) 会使 MT-Bench 的一致性得分降低 33-41 分,且模型排名最高波动 14 位。此外,研究指出存在一致性悖论,即测试-重测可靠性高于 0.95 的模型仍可能携带超过 0.10 的严重位置偏见。
全文
elvis (@omarsar0) 转发了 DAIR.AI (@dair_ai) 的帖子:
The largest LLM-as-a-Judge reliability audit yet.
Researchers ran 21 judges from nine providers over roughly 541,000 judgments on MT-Bench, JudgeBench, and RewardBench.
Findings:
Validating a judge with exact-match agreement overstates its skill, because exact match does not correct for chance.
Switching to Cohen's kappa deflates agreement by 33 to 41 points on MT-Bench, and judge rankings move by up to 14 positions across benchmarks.
There is also a consistency paradox. Two production-deployed judges score above 0.95 test-retest reliability while carrying severe position bias above 0.10, so a judge can agree with itself every time and still be wrong in the same direction every time.
Paper: https://t.co/Jh8U1R2svQ
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
