返回精选
AI 精选动态 智能评分 60

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

来源: twitter关注列表
作者: Rohan Paul (@rohanpaul_ai)
发布于: 2026-06-27
收录于: 2026-06-27
AI 推荐理由
建议点开原文了解RiVER的具体排名加权机制,对没有标准答案的RL训练有借鉴意义。
核心解读
论文提出RiVER方法,通过在没有已知正确答案的编码问题上执行强化学习来提升LLM的编码行为,在12个AtCoder Heuristic Contest任务上训练,同时提升了基于分数的竞赛表现和通过性编码基准。
全文
LLMs can learn better coding behavior from problems with no known answers. Many real problems do not have a gold solution waiting in a database, especially in optimization, where the best answer may be unknown, expensive, or impossible to certify. Normal reinforcement learning works well when it can check a clear right answer, but that breaks down when the best answer is unknown. The paper’s method, called RiVER, lets the model write several programs, runs them on the same hidden tests, and rewards the programs that perform better than the others. The key trick is that RiVER does not trust raw scores directly, because some test cases naturally produce much bigger numbers and can distort training. Instead, it ranks programs within each test case, gives extra weight to the best one, and still gives smaller graded feedback to other valid programs. The authors trained models on 12 AtCoder Heuristic Contest tasks, and RiVER improved both score-based contest performance and normal pass-or-fail coding benchmarks. ---- Link – arxiv. org/abs/2606.27369 Title: "Reinforcement Learning without Ground-Truth Solutions can Improve LLMs" ![photo](https://pbs.twimg.com/media/HLzUx10boAA7-4Q.jpg)
#技术突破#模型#研究