Policy2026-05-14
Teacher-Guided Policy Optimization for LLM Distillation
Source: Arxiv CS.AI
arXiv:2605.13230v1 Announce Type: cross Abstract: The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical...
arxivpapers