BeClaude
Research2026-05-08

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

Source: Arxiv CS.AI

arXiv:2605.06213v1 Announce Type: new Abstract: Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the...

arxivpapersbenchmark