Research2026-05-14
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
Source: Arxiv CS.AI
arXiv:2605.12673v1 Announce Type: new Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in...
arxivpapersagentsbenchmark