Research2026-04-23
MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
Source: Arxiv CS.AI
arXiv:2604.19809v1 Announce Type: new Abstract: We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000...
arxivpapersbenchmark