返回精选
AI 精选动态 智能评分 60

AA-Briefcase 基准测试揭示模型耗时

来源: twitter关注列表
作者: Artificial Analysis (@ArtificialAnlys)
发布于: 2026-06-24
收录于: 2026-06-24
AI 推荐理由
提供了各模型在长时间知识任务上的具体耗时数据,值得关注 GPT-5.5 的高效性和 GLM-5.2 的开源表现。
核心解读
Artificial Analysis 发布 AA-Briefcase 基准测试,测量模型在长时间知识任务中的平均耗时。Claude Opus 4.8 平均每任务约 23 分钟,GPT-5.5 (xhigh) 约 11 分钟,GLM-5.2 约 16.3 分钟。Claude Fable 5 若可用预计约 28.5 分钟,工具调用仅占 12% 时间。
全文
Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects. One of the key metrics we measure in AA-Briefcase is average time per task. This is calculated using evaluation token usage, representative model output speeds, and tool execution time recorded during evaluation. Key time per task takeaways from AA-Briefcase: ➤ Claude Opus 4.8 is the highest-scoring available model, but it is also one of the slowest, taking ~23 minutes per task on average ➤ Several GPT-5.5 reasoning variants lie along the Pareto frontier of AA-Briefcase Elo vs. Time per Task, including medium, high, and xhigh. GPT-5.5 (xhigh) in particular stands out as one of the most efficient top-performing models, using around half the time per task of Opus 4.8 (11 minutes) while ranking top 5 on the overall AA-Briefcase Elo ➤ GLM-5.2 also sits on the Pareto frontier, scoring 1261, ahead of GPT-5.5 (xhigh, 1159) but also taking more time per task (16.3 minutes). It is also the top-performing open weights model on AA-Briefcase, with MiniMax-M3 the next best at 1113 ➤ If Claude Fable 5 were still available, it would likely take around 28.5 minutes per task: while it was live, we measured ~91 output tokens per second, ~3.1 minutes of tool execution time per task, and ~139,000 output tokens per task ➤ Time spent on tool calls and execution accounts for only ~12% of the total time, with the remaining amount explained by output verbosity, turn usage, and inference speed ![photo](https://pbs.twimg.com/media/HLnPZjoXIAA1Rv4.jpg) Artificial Analysis (@ArtificialAnlys): For more details: https://t.co/QE7luoJ7oX
#基准测试#模型#分析