Research2026-05-08

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

arXiv:2605.06213v1 Announce Type: new Abstract: Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the...

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark