Research2026-05-12

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

arXiv:2605.08513v1 Announce Type: cross Abstract: Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron...

Read Original Article on Arxiv CS.AI

arxivpaperssafety