Policy2026-05-14
Delightful Distributed Policy Gradient
Source: Arxiv CS.AI
arXiv:2603.20521v2 Announce Type: replace-cross Abstract: Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but...
arxivpapers