AI 精选动态
智能评分 65
视角规划研究 提升 VLM 成功率
AI 推荐理由
RL-Graph-SFT 框架把 VLM 视角规划成功率从 2.5% 提升到 47.8%。核心解读
视角规划研究小组提出 ViewSuite,提供 6 维自由度相机控制和约 165K 个任务实例,并两次评测 Path-to-View、View-to-Path 与 Interactive View Planning,发现模型只能粗略追踪摄像机动作,无法形成完整规划。对 Qwen2.5-VL-7B 进行 RL 训练仅 2.5% 成功率;采用 View Graph Distillation(RL-Graph-SFT)后成功率提升至 47.8%。
全文
Fei-Fei Li (@drfeifei) 转发了 Manling Li (@ManlingLi_) 的帖子:
Planning with the views:
Can VLMs predict how each camera move changes the view, and plan many such moves ahead?
We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing:
Path-to-View
View-to-Path
Interactive View Planning
A sharp Planning Gap emerges:
+ can roughly "track" how camera action changes views
- cannot "compose" a plan towards a target view at all
We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B.
+ With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%
Below, we answer these questions:
Q1. What are the failure modes?
Q2. How can we make RL work?
Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?
Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang.
A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.
https://video.twimg.com/amplify_video/2067696956114407424/vid/avc1/1920x1080/qdXJ7UpSfpqPlUDJ.mp4?tag=28