Research2026-05-14
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Source: Arxiv CS.AI
arXiv:2604.02022v3 Announce Type: replace Abstract: Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain...
arxivpapersagentsbenchmarksafety