Policy2026-05-12
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
Source: Arxiv CS.AI
arXiv:2605.10194v1 Announce Type: new Abstract: On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on...
arxivpapers