BeClaude
Research2026-05-01

Characterizing the Consistency of the Emergent Misalignment Persona

Source: Arxiv CS.AI

arXiv:2604.28082v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in...

arxivpapers