Research2026-05-14

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

arXiv:2604.02022v3 Announce Type: replace Abstract: Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain...

Read Original Article on Arxiv CS.AI

arxivpapersagentsbenchmarksafety