BeClaude
Research2026-05-06

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Source: Arxiv CS.AI

arXiv:2602.10437v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning...

arxivpapersrl