Research2026-05-06
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Source: Arxiv CS.AI
arXiv:2604.06132v2 Announce Type: replace Abstract: Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness...
arxivpapersagents