Research2026-05-11

An Interpretable and Scalable Framework for Evaluating Large Language Models

arXiv:2605.07046v1 Announce Type: cross Abstract: Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response...

Read Original Article on Arxiv CS.AI

arxivpapers