Research2026-04-23

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

arXiv:2604.19809v1 Announce Type: new Abstract: We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000...

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark