AI 精选动态
智能评分 60
LLM在文档问答中的幻觉率研究
AI 推荐理由
该研究提供了具体的幻觉率数据,并揭示了长上下文对幻觉的显著影响,值得阅读原文以了解完整实验设置。核心解读
一项使用172B token的研究测试了LLM在文档问答场景中的幻觉率,最佳模型在32K上下文时幻觉1.19%,强模型通常为5-7%,中等模型约25%,200K上下文时所有模型至少10%。
全文
Rohan Paul (@rohanpaul_ai) 转发了 Gary Marcus (@GaryMarcus) 的帖子:
Crazy how many times people have told me over the last five years that a solution to hallucinations was right around the corner — and yet here we still are.
> **引用原帖 Rohan Paul (@rohanpaul_ai):**
> This study tests how often LLMs invent answers when they should rely only on supplied documents.
> The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material.
> This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context.
> For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist.
> Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time.
> Shows that hallucination is not just a failure to retrieve the right sentence.
> A model can be good at finding real facts and still be too willing to answer when the requested fact is absent.
> ----
> Link – arxiv. org/abs/2603.08274
> Title: "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms"
> https://x.com/rohanpaul_ai/status/2070193617554264542