Research2026-04-30

Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex

arXiv:2604.14858v2 Announce Type: replace Abstract: As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and...

Read Original Article on Arxiv CS.AI

arxivpapersbenchmarksafety