BeClaude
Policy2026-05-11

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Source: Arxiv CS.AI

arXiv:2602.10693v3 Announce Type: replace-cross Abstract: Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an...

arxivpapers