Research2026-05-12
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Source: Arxiv CS.AI
arXiv:2605.08513v1 Announce Type: cross Abstract: Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron...
arxivpaperssafety