Research2026-04-22
Benchmarking Misuse Mitigation Against Covert Adversaries
Source: Arxiv CS.AI
arXiv:2506.06414v2 Announce Type: replace-cross Abstract: Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because...
arxivpapersbenchmark