Policy2026-05-06
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
Source: Arxiv CS.AI
arXiv:2604.28123v2 Announce Type: replace-cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional...
arxivpapersmultimodal