Policy2026-05-11

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

arXiv:2605.07331v1 Announce Type: cross Abstract: Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy...

Read Original Article on Arxiv CS.AI

arxivpapers