Policy2026-05-06

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

arXiv:2604.28123v2 Announce Type: replace-cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional...

Read Original Article on Arxiv CS.AI

arxivpapersmultimodal