Policy2026-04-30
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Source: Arxiv CS.AI
arXiv:2603.19470v2 Announce Type: replace-cross Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows...
arxivpapers