Research2026-05-11
Steered LLM Activations are Non-Surjective
Source: Arxiv CS.AI
arXiv:2604.09839v2 Announce Type: replace Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating...
arxivpapers