AI 精选动态智能评分 67

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

来源: Anthropic-research

发布于: 2023-10-05

收录于: 2026-05-21

AI 推荐理由

可重点看其将 512 个 neuron 分解为 4000 多个 feature 的方法与案例，这为可解释性研究提供了可复现的分析框架。

核心解读

Transformer Circuits 团队在论文《Towards Monosemanticity: Decomposing Language Models With Dictionary Learning》中提出，语言模型的分析单元可能不是单个 neuron，而是由 neuron 激活线性组合形成的 feature。作者表示，他们已在小型 transformer 模型中构建出一套机制，可将一层 512 个 neuron 分解为 4000 多个 feature，这些 feature 分别对应 DNA 序列、法律语言、HTTP requests、Hebrew 文本、营养陈述等模式。论文认为，许多模型属性在只看单个 neuron 激活时是不可见的。

#研究突破#大模型#技术突破

阅读原始全文