LLMs Show Promise but Inconsistency on Scrum Certification Exams
Two new studies evaluate how well large language models perform on Scrum certification-style questions, revealing high accuracy but notable instability and error patterns that raise concerns for AI-assisted exam preparation.
What Happened
Two recent preprints on arXiv examine the performance of large language models (LLMs) on questions similar to those found in Scrum certification exams, such as the Professional Scrum Master (PSM). The first study, "Comparing Large Language Models on Scrum Certification-Style Questions: Accuracy, Stability, and Error Patterns," benchmarks multiple LLMs on a curated set of Scrum questions, measuring not only accuracy but also the consistency of answers across repeated queries. The second study, "Prompting GPT-5 on Scrum Certification Questions: An Empirical Accuracy Study," focuses specifically on GPT-5, analyzing its accuracy and the types of errors it makes when answering Scrum knowledge questions.
Both studies find that LLMs can achieve high accuracy on these domain-specific questions, often exceeding 80% correct. However, they also reveal significant instability: the same model may give different answers to the same question when asked multiple times, and error patterns suggest that models struggle with nuanced or context-dependent aspects of Scrum, such as the difference between a Scrum Master and a Product Owner in certain scenarios.
Why It Matters
As AI tools become more integrated into professional training and certification preparation, understanding their limitations is crucial. Scrum certifications are widely recognized in software development and project management, and many practitioners use LLMs to study or even simulate exam conditions. If models are inconsistent or systematically wrong on certain topics, users may develop incorrect knowledge or overestimate their readiness.
The findings also highlight a broader challenge: LLMs can appear competent on standardized tests but lack the deep, contextual understanding required for real-world application. In Agile environments, where interpretation and adaptation are key, relying on an LLM's answer without critical thinking could lead to poor decisions.
Implications for AI Practitioners
For developers and trainers building AI-assisted learning tools, these studies underscore the need for:
- Robust evaluation beyond accuracy: Metrics like stability (consistency across runs) and error type analysis are essential to gauge true reliability.
- Domain-specific fine-tuning: General-purpose LLMs may need additional training on Scrum-specific materials to reduce errors and improve consistency.
- User guidance: Practitioners should be warned that LLM answers may vary and should be cross-checked with official sources.
Key Takeaways
- LLMs achieve high accuracy (often >80%) on Scrum certification-style questions but show significant answer instability across repeated queries.
- Error patterns indicate struggles with nuanced or context-dependent Scrum concepts, which could mislead learners.
- AI practitioners should evaluate models on stability and error types, not just accuracy, when deploying for exam preparation.
- Users should verify LLM-generated answers with official Scrum guides and treat AI as a supplementary tool, not a primary source of truth.