AI 精选动态
智能评分 65
GLM-5.2在CritPt基准上匹配Claude Opus 4.8
AI 推荐理由
GLM-5.2在硬核物理基准上追平顶尖专有模型,较前代跃升4.5倍,是开放权重模型科学推理能力的重要里程碑,值得关注原文及后续分析。核心解读
Z ai的GLM-5.2(最大推理努力)在CritPt基准上得分为20.9%,与Claude Opus 4.8持平,远超其他开放权重模型(DeepSeek V4 Pro为12.9%),并超越GPT-5.5、Gemini 3.1 Pro等专有模型。相比10周前GLM-5.1的4.6%,实现4.5倍跃升。CritPt由Argonne和UIUC联合开发,答案保密,由Artificial Analysis独立评测。
全文
A standout number in Z ai’s GLM-5.2 launch is CritPt, a benchmark of unpublished research-level physics problems where it ties with Claude Opus 4.8 and is well above other open weights models
Key takeaways:
➤ @Zai_org ’s GLM-5.2 (max reasoning effort) leads open weights by a wide margin: the next open model, DeepSeek V4 Pro, scores 12.9%
➤ GLM-5.2 matches Claude Opus 4.8 (20.9%) and beats several proprietary models, including GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7
➤ Only proprietary models score higher with GPT-5.5 Pro topping the benchmark at 30.6%
➤ A 4.5× generational jump: GLM-5.1 scored just 4.6% on CritPt ten weeks ago

Artificial Analysis (@ArtificialAnlys): Context on the result: CritPt is hard. It focuses on frontier physics problems developed by Argonne and UIUC through contributions from 60+ researchers globally, with the answer key and grading kept private. Models are independently benchmarked by Artificial Analysis.
Even the highest-scoring model, GPT-5.5 Pro, solves under a third of the problems.
For an open weights model to approach leading proprietary models is a real marker of progress for open models on scientific reasoning.
https://t.co/Y55fgUEoaJ