Research2026-06-24

Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

arXiv:2606.23716v1 Announce Type: cross Abstract: Legal AI benchmark research frequently invokes the assumption that large language models can improve access to justice, including for people who cannot access lawyers in order to understand and exercise their legal rights. We argue that current...

The Gap Between Legal Reasoning and Lawyering

A new preprint from arXiv (2606.23716v1) challenges a foundational assumption in legal AI research: that improving a model’s performance on legal reasoning benchmarks automatically translates to better access to justice for pro se litigants. The authors argue that current benchmarks conflate “legal reasoning” with the full spectrum of “lawyering” tasks—a distinction with significant practical consequences.

What the Research Actually Claims

The paper’s core insight is deceptively simple. Legal reasoning benchmarks typically test a model’s ability to parse statutes, apply rules to facts, or answer multiple-choice bar-style questions. But pro se individuals—people representing themselves without lawyers—face a different set of challenges. They need help navigating procedural rules, completing court forms, understanding jurisdiction, managing deadlines, and communicating effectively with judges and opposing parties. These are process-oriented, contextual, and often emotionally charged tasks that current benchmarks do not capture.

The authors do not claim that LLMs are useless for access to justice. Rather, they argue that the field is measuring the wrong thing. A model that scores 90% on a legal reasoning dataset may still fail catastrophically when a user asks, “How do I file a motion to dismiss in small claims court in Texas?”—a question that requires procedural knowledge, local rule awareness, and plain-language explanation, not abstract reasoning about contract law.

Why This Matters for AI Practitioners

For developers building legal AI tools, this paper is a necessary corrective. The temptation to optimize for benchmark scores is strong—they are quantifiable, publishable, and easy to compare. But if those benchmarks do not reflect real user needs, optimization becomes a form of misalignment.

Practitioners should consider three concrete implications:

First, benchmark design must be user-centered. Instead of testing only legal reasoning, researchers should create datasets that simulate common pro se tasks: form completion, deadline calculation, plain-language summarization of court rules, and triage of legal issues. These tasks are messier to evaluate but far more relevant.

Second, deployment risk is higher than benchmark scores suggest. A model that passes a legal reasoning test may still give procedurally incorrect advice—for example, telling a user to file a document in the wrong court or by the wrong method. Such errors can cause real harm, including case dismissal or missed deadlines.

Third, evaluation must include non-lawyer feedback. The paper implicitly calls for involving actual pro se litigants in testing. What a lawyer considers a good explanation may be incomprehensible to someone without legal training. Measuring user comprehension and task completion is more informative than measuring accuracy against a legal expert’s answer key.

Key Takeaways

Current legal AI benchmarks overemphasize abstract reasoning and underrepresent the procedural, contextual tasks that pro se litigants actually need help with.
High performance on legal reasoning datasets does not guarantee useful or safe performance in real-world access-to-justice applications.
AI practitioners should develop task-specific benchmarks for form completion, procedural guidance, and plain-language explanation, tested with non-expert users.
Misalignment between benchmark metrics and user needs creates deployment risks that could undermine the very goal of improving access to justice.

Read Original Article on Arxiv CS.AI

arxivpapersreasoningbenchmark