Policy2026-05-11
How Log-Barrier Helps Exploration in Policy Optimization
Source: Arxiv CS.AI
arXiv:2603.15001v2 Announce Type: replace-cross Abstract: Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process,...
arxivpapers