Research2026-05-06

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

arXiv:2602.10437v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning...

Read Original Article on Arxiv CS.AI

arxivpapersrl