返回精选
AI 精选动态 智能评分 60

Artificial Analysis 发布 AA-Briefcase 基准测试

来源: twitter关注列表
作者: Artificial Analysis (@ArtificialAnlys)
发布于: 2026-06-22
收录于: 2026-06-22
AI 推荐理由
该基准测试提供了具体成本与性能数据,值得原文查看详细模型排行和成本效率分析。
核心解读
Artificial Analysis 发布 AA-Briefcase 代理知识工作基准测试,测试模型在长期任务中构建金融模型、董事会演示等可交付成果的能力。最高性能模型 Claude Fable 5 单次任务成本超 $20,成本效率为模型选择关键。在成本-性能帕累托前沿中,开源模型占多数:GLM 5.2 (max) 以 $2.40/次成本达 Claude Opus 4.8 的 90 Elo 分内(成本低 65%);DeepSeek V4 Pro (max) 以 $0.08/次成本比 Gemini 3.5 Flash 高约 60 Elo 分(成本低 98%)。
全文
Open weights models make up the majority of the cost-performance Pareto frontier on AA-Briefcase, our new agentic knowledge work benchmark Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects. The cost to run a single AA-Briefcase task varies by over 700x in the initial set of models we tested. With the highest performing model, Claude Fable 5, costing over $20 per task, cost efficiency is a key element in model selection for knowledge work. While the two highest performing models on the cost-performance Pareto frontier are proprietary models from @AnthropicAI, most of the remaining frontier is made up of open weights models. Notable cost efficiency trade offs: ➤ At $2.40 per task, GLM 5.2 (max) from @Zai_org scores within 90 Elo points of Claude Opus 4.8 (max) while costing 65% less ➤ At $0.08 per task, DeepSeek V4 Pro (max) from @deepseek_ai scores ~60 Elo points above Gemini 3.5 Flash while costing over 98% less ![photo](https://pbs.twimg.com/media/HLcWZNQaoAA1g3d.jpg) Artificial Analysis (@ArtificialAnlys): For full AA-Briefcase analysis, read our launch article here: https://t.co/QE7luoJFev For full AA-Briefcase results, see here: https://t.co/RgkI2BmI6R
#基准测试#分析#技术更新