返回精选
AI 精选动态 智能评分 60

AA-Briefcase: 长期知识工作模型基准发布

来源: twitter关注列表
作者: Artificial Analysis (@ArtificialAnlys)
发布于: 2026-06-18
收录于: 2026-06-18
AI 推荐理由
该基准聚焦多周、高复杂性任务,与常见单轮测试差异大,揭示了模型在真实混乱环境中的表现差距和成本权衡。
核心解读
Artificial Analysis 发布 AA-Briefcase 基准测试,评估模型在长期知识工作项目中的表现。Claude Fable 5 以 1587 Elo 领先,成本约 $31/任务;紧随其后的是 Claude Opus 4.8(1356 Elo, $10.40)和 GLM-5.2(1266 Elo, $2.40)。仅 3% 任务中顶级模型满足所有标准,表明现实复杂度仍是挑战。
全文
Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files. We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org. Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40. AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only. Key elements of AA-Briefcase: ➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups ➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work ➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor ➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work Key results: ➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase ➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost ➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria ➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models More details below in thread ⬇️ ![photo](https://pbs.twimg.com/media/HLIVpu5bwAA364R.jpg) Artificial Analysis (@ArtificialAnlys): Sample AA-Briefcase outputs from a single commercial due diligence task show how far model quality varies. Claude Fable 5 (currently unavailable) produces a clean and structured market map with footnoted figures and flagged uncertainties; GPT-5.5 (xhigh) creates a relatively detailed document albeit some layout issues; Gemini 3.1 Pro generates a very sparse diagram without supplemental analysis Artificial Analysis (@ArtificialAnlys): See our launch article for further details and analysis: https://t.co/jiQiwT7ZLY Full results on our website: https://t.co/RgkI2BmI6R
#基准测试#模型#行业动态