Research2026-05-12
Do Linear Probes Generalize Better in Persona Coordinates?
Source: Arxiv CS.AI
arXiv:2605.09391v1 Announce Type: new Abstract: It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging,...
arxivpapers