Research2026-05-14
Tracing Persona Vectors Through LLM Pretraining
Source: Arxiv CS.AI
arXiv:2605.13329v1 Announce Type: cross Abstract: How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or...
arxivpapers