Research2026-05-12

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

arXiv:2605.10633v1 Announce Type: cross Abstract: Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space,...

Read Original Article on Arxiv CS.AI

arxivpapers