AI 精选动态
智能评分 60
Artificial Analysis 发布 AA-Briefcase 基准测试
AI 推荐理由
该基准测试提供了具体成本与性能数据,值得原文查看详细模型排行和成本效率分析。核心解读
Artificial Analysis 发布 AA-Briefcase 代理知识工作基准测试,测试模型在长期任务中构建金融模型、董事会演示等可交付成果的能力。最高性能模型 Claude Fable 5 单次任务成本超 $20,成本效率为模型选择关键。在成本-性能帕累托前沿中,开源模型占多数:GLM 5.2 (max) 以 $2.40/次成本达 Claude Opus 4.8 的 90 Elo 分内(成本低 65%);DeepSeek V4 Pro (max) 以 $0.08/次成本比 Gemini 3.5 Flash 高约 60 Elo 分(成本低 98%)。
全文
Open weights models make up the majority of the cost-performance Pareto frontier on AA-Briefcase, our new agentic knowledge work benchmark
Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects.
The cost to run a single AA-Briefcase task varies by over 700x in the initial set of models we tested. With the highest performing model, Claude Fable 5, costing over $20 per task, cost efficiency is a key element in model selection for knowledge work.
While the two highest performing models on the cost-performance Pareto frontier are proprietary models from @AnthropicAI, most of the remaining frontier is made up of open weights models.
Notable cost efficiency trade offs:
➤ At $2.40 per task, GLM 5.2 (max) from @Zai_org scores within 90 Elo points of Claude Opus 4.8 (max) while costing 65% less
➤ At $0.08 per task, DeepSeek V4 Pro (max) from @deepseek_ai scores ~60 Elo points above Gemini 3.5 Flash while costing over 98% less

Artificial Analysis (@ArtificialAnlys): For full AA-Briefcase analysis, read our launch article here: https://t.co/QE7luoJFev
For full AA-Briefcase results, see here: https://t.co/RgkI2BmI6R