Research2026-05-06

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

arXiv:2605.01847v1 Announce Type: new Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through...

Read Original Article on Arxiv CS.AI

arxivpapersagentsbenchmark