AI 精选动态智能评分 62

GLM-5.2 移动端应用开发能力评测与 Prompt 方法论

来源: twitter关注列表

作者: Z.ai (@Zai_org)

发布于: 2026-06-22

收录于: 2026-06-22

AI 推荐理由

提供了将应用开发量化为 5 轮迭代的具体 Prompt 架构及数值约束方法（如 spring response 0.3-0.4），具有实际工程参考价值。

核心解读

Zixuan Li 分享了一套由任务描述（PRD 风格）和五轮优化流程组成的应用开发 Prompt 架构，旨在通过量化约束和资产审计提升生成质量。在 35 个移动开发任务的 70 次内部测试中，GLM-5.2 的完成率为 48/70，较 GLM-5.1 的 21/70 提升两倍以上，低于 Claude Fable 5 的 56/70。

全文

Z.ai (@Zai_org) 转发了 Zixuan Li (@ZixuanLi_) 的帖子： Here is the prompt method behind this AR try-on app. The trick is not a magic prompt. It is the architecture of the prompt, and it works across GLM-5.2 and other frontier models. Full prompt: https://t.co/4EJxWlNw0G The prompt has two parts. First, a task description. You write this fresh for each app to define the business logic. Second, a five-round polish process: Round 1 through Round 5. The structure is fixed and reusable across any app, but the specific content of each Round is tailored to the app at hand. The flow is simple. The task description builds the functional skeleton, then the five Rounds run in sequence to refine it into something that looks like a finished product. Why split it this way? Single-pass generation always prioritizes "it runs" over "it looks good." So it's better not to chase one perfect prompt. Divide the work. The business description makes it function. The five Rounds make it look like a real product. The polish is a reusable pipeline, not something you reinvent every time, even though you fill in app-specific details each time. How to write the task description: Treat it as a real PRD and engineering spec, not a user wishlist. Include the tech stack, information architecture, module specs, API integration, data model, and acceptance criteria. Declare autonomy at the top. State that the model should not ask questions, not stop early, and verify its own work. Otherwise it will pause to ask and break the long task. Write the fallback paths explicitly. Cover unsupported devices, older OS versions, and offline states. If you skip this, the model improvises at the edges and crashes. Number your acceptance criteria. Each should be independently verifiable, for example "tap a product and the look changes within 0.5 seconds." The principles behind the five Rounds: Quantify "good" into numbers. Models execute poorly on adjectives and precisely on constraints. Use spring response 0.3 to 0.4, button scale 0.95 to 1.0, at most 5 font sizes, and sound effects under 200ms. These principles stay constant, even as the exact targets shift per app, which is why the structure can stay fixed. List what is forbidden. Models cut corners in predictable ways, such as gray placeholders, solid color blocks, and spinners. Name them directly with "DO NOT" and provide an acceptable fallback. Inventory before fixing. Each Round follows the same loop: audit every asset, verify it is not a placeholder, replace, amplify, and re-screenshot to confirm. Strip the "tutorial" feel. AI output gives itself away with faker text, .test links, and emoji-only empty states. The final Round removes these. > **引用原帖 Zixuan Li (@ZixuanLi_):** > GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks. > Results: > - GLM-5.1: 21/70 > - GLM-5.2: 48/70 > - Claude Fable 5: 56/70 > That's more than a twofold improvement from GLM-5.1 to GLM-5.2. > These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues. > https://x.com/ZixuanLi_/status/2067803136283005393

#大模型#基准测试#分析

阅读原始全文