Research2026-05-07

Self-Mined Hardness for Safety Fine-Tuning

arXiv:2605.03226v1 Announce Type: cross Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the...

Read Original Article on Arxiv CS.AI

arxivpaperssafetyfine-tuning