Research2026-05-12
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Source: Arxiv CS.AI
arXiv:2605.10633v1 Announce Type: cross Abstract: Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space,...
arxivpapers