BeClaude
Policy2026-05-01

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Source: Arxiv CS.AI

arXiv:2604.28123v1 Announce Type: cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that...

arxivpapersrlmultimodal