Research2026-07-03

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

Originally published byArxiv CS.AI

arXiv:2606.22737v2 Announce Type: replace Abstract: Before letting an agent operate over real context, can you prove it used the right evidence? GroundEval turns that question into a deterministic test of what the agent searched, fetched, cited, and was permitted to access. In one case study, two...

The GroundEval Shift: From Opinion to Audit in Agent Evaluation

The research community has long wrestled with a fundamental tension in evaluating AI agents: how do you know an agent actually used the right information, rather than just producing a plausible answer? The new preprint GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation proposes a radical departure from the dominant paradigm. Instead of asking a large language model to judge whether an agent’s output is correct, GroundEval asks a deterministic question: did the agent access, fetch, cite, and process the specific evidence it was permitted to see?

The core insight is elegantly simple. Current evaluation methods for stateful agents—those that perform multi-step actions like web searches, database queries, or file retrievals—rely heavily on LLM-as-judge. This approach uses a second model to score the first model’s reasoning, but it introduces a host of problems: judge models hallucinate, exhibit positional bias, and cannot reliably verify factual grounding. GroundEval replaces this subjective assessment with a verifiable trace. By instrumenting the agent’s environment to log every search, fetch, and citation, the method creates a deterministic proof of whether the agent’s decisions were grounded in the evidence it was allowed to access.

Why This Matters for AI Practitioners

For anyone building production agents—whether for customer support, legal research, or medical triage—this shift is significant. The LLM-as-judge approach is brittle and expensive. It requires careful prompt engineering, often fails on edge cases, and provides no guarantee that the evaluation itself is correct. GroundEval offers a concrete alternative: a test that can be run deterministically, audited by a human, and scaled without the cost of additional model inference.

The case study in the paper reveals a practical concern. Two agents, ostensibly performing the same task, produced different outputs. An LLM judge might have scored them similarly based on surface-level coherence. GroundEval, by contrast, could pinpoint that one agent never accessed a critical database, while the other fetched but ignored a key document. This granularity is invaluable for debugging and compliance.

Implications for the Field

This work challenges the assumption that evaluation must be as complex as the system being evaluated. It suggests that for many agent tasks, the ground truth is not the output’s eloquence but the agent’s procedural fidelity. For practitioners, this means rethinking evaluation pipelines: instrumenting agents for traceability becomes as important as training them for accuracy. It also raises a strategic question: as agents become more autonomous, will regulators demand deterministic proof of reasoning, rather than probabilistic confidence scores?

Key Takeaways

Deterministic verification replaces subjective judgment: GroundEval proves what evidence an agent actually used, eliminating the hallucination and bias risks of LLM-as-judge.
Traceability is the new accuracy metric: For stateful agents, the ability to audit every search, fetch, and citation is more valuable than a single output score.
Cost and reliability improve: Deterministic evaluation reduces reliance on expensive judge models and provides verifiable, human-auditable results.
Practical debugging becomes possible: Practitioners can pinpoint exactly where an agent failed to access or apply evidence, enabling targeted fixes rather than vague retraining.

Read Original Article on Arxiv CS.AI

arxivpapersagents