Policy2026-05-12

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

arXiv:2605.10194v1 Announce Type: new Abstract: On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on...

Read Original Article on Arxiv CS.AI

arxivpapers