Research2026-04-20

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

arXiv:2602.05523v2 Announce Type: replace-cross Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across...

Read Original Article on Arxiv CS.AI

arxivpapersagents