Skip to content
BeClaude
Research2026-07-01

Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code

Originally published byArxiv CS.AI

arXiv:2606.30689v1 Announce Type: cross Abstract: Spec-Driven Development (SDD) frameworks guide Large Language Model (LLM)-powered code generation through formal specifications, yet they differ fundamentally in how they enforce traceability between requirements and generated code. This paper...

A New Lens on LLM Code Generation: Traceability as a Quality Signal

The paper "Citation Discipline in Spec-Driven Development" tackles a subtle but critical problem in AI-assisted software engineering: how to ensure that code generated by large language models (LLMs) actually adheres to the specifications it was given. The authors propose a cross-model empirical study examining "output determinism" and "automated hallucination detection" within Spec-Driven Development (SDD) frameworks.

At its core, the research introduces a novel evaluation dimension—citation discipline. Just as academic papers cite sources to support claims, the authors argue that LLM-generated code should be able to "cite" the specific requirements it implements. When a model cannot reliably map its output back to the original spec, that output is a candidate for hallucination. The study likely compares multiple LLMs (e.g., GPT-4, Claude, Llama) across different SDD frameworks to measure how consistently they produce code that can be traced to formal requirements.

Why this matters. The software industry is rapidly adopting AI code generation, but trust remains the primary barrier to production deployment. Current evaluation metrics—pass@k, functional correctness, test coverage—measure what the code does, but not why it does it. This paper shifts the focus from output correctness to output attribution. If a developer cannot determine which requirement a line of code satisfies, they cannot confidently debug, audit, or maintain that code. This is especially critical in regulated industries (finance, healthcare, aerospace) where traceability is a legal requirement, not just a best practice.

The concept of "output determinism" is also significant. LLMs are inherently non-deterministic—the same prompt can yield different outputs. The paper likely quantifies how much variation exists across runs and models, and whether SDD frameworks reduce this variance. For practitioners, this means understanding that "reproducible code generation" is not guaranteed; it must be engineered through disciplined specification practices.

Implications for AI practitioners. First, this research suggests that teams should adopt SDD frameworks that enforce bidirectional traceability between requirements and generated code. Tools like LangChain's structured output, Anthropic's tool use, or custom parser-based validators could be extended to include citation metadata. Second, automated hallucination detection becomes more actionable: instead of relying on vague heuristics, teams can flag any code block that fails to cite its originating requirement. Third, the cross-model comparison provides a practical benchmark—practitioners can choose models not just on raw coding ability, but on citation discipline as a quality filter.

The paper implicitly warns against treating LLMs as black-box code generators. The future of AI-assisted development lies not in generating more code, but in generating accountable code. This research provides a framework for measuring that accountability.

Key Takeaways

  • Traceability is a measurable quality metric. Citation discipline offers a concrete way to evaluate whether LLM-generated code truly implements specified requirements, reducing blind trust in outputs.
  • Hallucination detection can be automated. By requiring code to cite its spec origins, teams can systematically flag untraceable outputs as potential hallucinations, moving beyond manual review.
  • Model selection should include determinism and attribution. Practitioners should benchmark models not just on pass rates, but on how consistently they produce spec-traceable code across multiple runs.
  • SDD frameworks need citation-aware tooling. Existing LLM orchestration tools should be extended to enforce and validate requirement-to-code citations as a first-class feature.
arxivpapers