Policy2026-04-28

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXiv:2603.25562v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a...

Read Original Article on Arxiv CS.AI

arxivpapers