Research2026-05-11
An Interpretable and Scalable Framework for Evaluating Large Language Models
Source: Arxiv CS.AI
arXiv:2605.07046v1 Announce Type: cross Abstract: Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response...
arxivpapers