Skip to content
BeClaude
Policy2026-07-01

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

Originally published byArxiv CS.AI

arXiv:2606.30923v1 Announce Type: cross Abstract: Imitation Learning is a natural framework for learning in sequential decision-making systems and has emerged as the dominant paradigm through which we understand language model training. A central puzzle is that, while in theory offline IL can be...

The Limits of Behavior Cloning in Noisy Environments

A new paper from arXiv (2606.30923v1) challenges a foundational assumption in imitation learning (IL): that offline behavior cloning (BC) is sufficient for training effective sequential decision-making systems, including large language models. The authors demonstrate that when expert demonstrations contain noise—whether from suboptimal human choices, labeling errors, or inherent stochasticity—offline BC degrades significantly. Instead, they prove that on-policy distillation (where the learner generates its own trajectories and compares them against a reference policy) is optimal for handling noisy expert feedback.

This is a technical but important result. Behavior cloning treats demonstrations as ground truth, minimizing the divergence between the learner's actions and the expert's. But if those demonstrations are imperfect, the learner inevitably inherits the noise. On-policy distillation, by contrast, allows the learner to explore and then correct its mistakes relative to a cleaner reference signal, effectively filtering out noise through iterative self-improvement.

Why This Matters

The paper directly addresses a practical pain point: real-world expert data is almost never pristine. In robotics, human teleoperation introduces jitter. In autonomous driving, human drivers make occasional errors. In language model training, human raters disagree on what constitutes a "good" response, and RLHF pipelines often rely on noisy preference labels. The current dominant approach—scraping massive datasets and applying BC—implicitly assumes the data is perfect. This work suggests that scaling alone won't fix the noise problem; the training algorithm itself must change.

For AI practitioners, the implication is clear: if your expert demonstrations contain any systematic noise (and they almost certainly do), offline BC is leaving performance on the table. On-policy distillation, while more computationally expensive (requiring online rollouts), offers a principled way to recover cleaner behavior. This aligns with recent empirical findings in RLHF, where on-policy methods like PPO often outperform offline methods like DPO when human feedback is noisy.

Implications for AI Practitioners

First, audit your data for noise. If your demonstrations come from multiple annotators, varying skill levels, or automated processes, expect degradation. Second, consider hybrid approaches: use BC for initial pretraining, then switch to on-policy distillation for fine-tuning. This mirrors the current best practice in LLM training (pretraining on web text, then RLHF). Third, budget for compute. On-policy methods require generating rollouts during training, which is more expensive than offline BC. However, the paper suggests this cost is justified when noise is non-negligible.

The paper does not claim that BC is useless—it remains a strong baseline for clean data. But for real-world systems where noise is inevitable, the optimality of on-policy distillation provides a rigorous justification for the shift toward interactive, self-correcting training paradigms.

Key Takeaways

  • Offline behavior cloning is provably suboptimal when expert demonstrations contain noise, as it propagates errors into the learner's policy.
  • On-policy distillation, where the learner generates its own trajectories and compares against a reference, is optimal for handling noisy expert feedback.
  • Practitioners should audit their demonstration data for noise and consider hybrid training pipelines that combine offline BC with on-policy fine-tuning.
  • The computational cost of on-policy methods is justified when data quality is imperfect, offering a principled path to more robust imitation learning.
arxivpapers