MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2604.19809v1 Announce Type: cross Abstract: We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,00

Discussion

No replies yet. Be first.

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

Related coverage