Research2026-05-07
Self-Mined Hardness for Safety Fine-Tuning
Source: Arxiv CS.AI
arXiv:2605.03226v1 Announce Type: cross Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the...
arxivpaperssafetyfine-tuning