BeClaude
Research2026-05-12

Interactive Benchmarks

Source: Arxiv CS.AI

arXiv:2603.04737v2 Announce Type: replace Abstract: Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of...

arxivpapersbenchmark