Research2026-05-12

Do Linear Probes Generalize Better in Persona Coordinates?

arXiv:2605.09391v1 Announce Type: new Abstract: It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging,...

Read Original Article on Arxiv CS.AI

arxivpapers