AI 精选动态
智能评分 60
阿里云开源Qwen-AgentWorld
AI 推荐理由
差异点:该模型原生训练环境建模而非后处理,且开源了完整模型和基准,提供了世界建模增强智能体的新路径。核心解读
Alibaba Cloud 开源了 Qwen-AgentWorld-35B-A3B(MoE,35B/3B active,256K 上下文)和 AgentWorldBench,该模型原生以环境建模为训练目标,能模拟 7 种智能体环境。在 AgentWorldBench 上超越 Claude Opus 4.8 和 GPT-5.4,并在 7 个基准测试中获得提升(如 Terminal-Bench 2.0 +6.3,SWE-Bench +3.4,WideSearch +12.8,Claw-Eval +11.3 等),且无需智能体特定微调即可将预测能力迁移到工具调用任务。
全文
📣📣 Meet Qwen-AgentWorld — a native language world model that simulates 7 agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within a single model. Environment modeling is the training objective from day one, not a post-hoc adaptation.
🤔 LLMs are trained to be better agents — better at acting in environments. But nobody has trained them to model the environments themselves.
🗺️ Our roadmap: investigate how language world modeling can push the boundaries of general agent capabilities, along two routes:
1️⃣ Build a foundation model for environment simulation — outperforming Claude Opus 4.8 and GPT-5.4 on AgentWorldBench
2️⃣ Investigate how world modeling enhances agent training:
🔬 Controllable Sim RL (agentic RL with LWM as environments) surpasses training in real environments
🧠 Learning to predict environments (LWM warm-up) makes agents stronger — remarkably, even without any agent-specific training, this predictive knowledge transfers to agentic tasks with zero fine-tuning
🔗 Model Studio: https://t.co/TY0rOHbxza

Alibaba Cloud (@alibaba_cloud): We open-source Qwen-AgentWorld-35B-A3B (MoE, 35B/3B active, 256K context) and AgentWorldBench.
Two routes, one roadmap:
🔬 Build the simulator — scalable, controllable, surpassing real environments
🧠 Internalize world modeling — predict before you act
Qwen-AgentWorld is our attempt to investigate how language world modeling can further expand the boundaries of general agent capabilities.
Go build on it 🏃🏃♂️
🔗 Model Studio: https://t.co/TY0rOHbxza
Alibaba Cloud (@alibaba_cloud): 🧠 Paradigm II — Agent Foundation Model: world modeling as agent capability.
Single-turn, non-agentic environment prediction → tested directly on multi-turn, tool-calling agent tasks. No agentic RL, no task-specific tuning.
Gains across 7 benchmarks, including 3 entirely out-of-domain:
- In-domain: Terminal-Bench 2.0 +6.3, SWE-Bench +3.4, WideSearch +12.8
- Out-of-domain: Claw-Eval +11.3, QwenClawBench +9.7, BFCL v4 +9.0
World modeling internalizes "predict before you act" as a transferable reasoning pattern.