Research2026-04-22
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Source: Arxiv CS.AI
arXiv:2604.18901v1 Announce Type: cross Abstract: Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families...
arxivpapers