Research2026-05-11

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

arXiv:2605.07342v1 Announce Type: cross Abstract: Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol...

Read Original Article on Arxiv CS.AI

arxivpapers