Research2026-04-17

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

arXiv:2506.03610v3 Announce Type: replace Abstract: Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across...

Read Original Article on Arxiv CS.AI

arxivpapersagentsbenchmark