BeClaude
Policy2026-04-30

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Source: Arxiv CS.AI

arXiv:2603.19470v2 Announce Type: replace-cross Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows...

arxivpapers