返回精选
AI 精选动态 智能评分 68

GLM-5.2 开源 Agent 评测第三

来源: twitter关注列表
作者: Artificial Analysis (@ArtificialAnlys)
发布于: 2026-06-22
收录于: 2026-06-22
AI 推荐理由
GLM-5.2 在真实工作场景评测中超越多数闭源模型且价格低廉,值得查看原文的具体任务示例以评估其实用性。
核心解读
Artificial Analysis 评测显示,GLM-5.2 在真实世界 agentic 工作基准 GDPval-AA 上获得 1524 Elo,排名第三,仅次于 Claude Fable 5 (1783) 和 Claude Opus 4.8 (1615),与 GPT-5.5 (xhigh, 1509) 持平,是开源模型中最高分,领先第二名 MiniMax-M3 (1408) 约 116 分,并超越 Gemini 3.5 Flash 等闭源模型。该模型定价为每百万输入/输出 token $1.40/$4.40。
全文
GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks. Key takeaways: ➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509) ➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408 ➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158) ➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches ➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase ![photo](https://pbs.twimg.com/media/HLb9CLEbsAA5rWl.jpg) Artificial Analysis (@ArtificialAnlys): The pattern holds on AA-Briefcase, our latest agentic knowledge work eval: GLM-5.2 is again the top open weights model, ahead of GPT-5.5 (xhigh) and behind only Claude Fable 5. For an open weights model priced at $1.40/$4.40 per 1M input/output tokens to rank alongside the proprietary frontier on agentic work is a real step for open models. https://t.co/Y55fgUEoaJ Artificial Analysis (@ArtificialAnlys): GDPval-AA spans real professional and creative work. We gave GLM-5.2 and three proprietary frontier models, Claude Fable 5, GPT-5.5, and Gemini 3.5 Flash, the same briefs, and rendered each deliverable exactly as produced: ➤ A daily task list for a retail supervisor ➤ An IEC emergency-stop circuit schematic ➤ A moodboard for an orchestral ballad music video
#模型#基准测试#开源