Research2026-05-01
Characterizing the Consistency of the Emergent Misalignment Persona
Source: Arxiv CS.AI
arXiv:2604.28082v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in...
arxivpapers