Research2026-05-06
Minimizing Collateral Damage in Activation Steering
Source: Arxiv CS.AI
arXiv:2605.01167v1 Announce Type: cross Abstract: Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. However, standard interventions, such as vector...
arxivpapers