Policy2026-05-12

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

arXiv:2605.10889v1 Announce Type: cross Abstract: On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the...

Read Original Article on Arxiv CS.AI

arxivpapers