Research2026-04-30

Test-Time Safety Alignment

arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion...

Read Original Article on Arxiv CS.AI

arxivpaperssafety