Research2026-06-30

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Originally published byArxiv CS.AI

arXiv:2507.07445v3 Announce Type: replace Abstract: Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on...

What Happened

Researchers have released StarDojo, a benchmark framework that evaluates multimodal LLMs within the farming simulation game Stardew Valley. Unlike conventional benchmarks that test isolated skills like object recognition or text comprehension, StarDojo places AI agents in a persistent, open-ended environment where they must simultaneously manage production tasks (farming, mining, resource gathering) and social interactions (building relationships with NPCs, participating in community events). The benchmark tracks long-horizon behaviors—such as planning crop cycles across seasons or maintaining friendships over weeks of in-game time—rather than single-shot task completion.

Why It Matters

Current AI evaluation is dominated by narrow, static benchmarks. A model might ace a visual question-answering dataset or a coding challenge, yet fail catastrophically when asked to coordinate multiple objectives in a dynamic world. StarDojo addresses this gap by demanding that agents exhibit what the authors call production-living competence: the ability to balance economic productivity with social and emotional intelligence.

This matters because the next generation of AI assistants—whether for robotics, virtual worlds, or real-world planning—will need to operate in environments where goals are not pre-defined and where trade-offs are unavoidable. Should an agent prioritize harvesting crops before they wilt, or attend a villager's birthday party to maintain social standing? StarDojo forces models to make such decisions, revealing whether they can handle the messy, multi-objective reality that humans navigate daily.

The choice of Stardew Valley is particularly astute. The game is complex enough to require genuine planning and memory, yet constrained enough to allow reproducible evaluation. It also introduces a social dimension rarely tested in AI benchmarks, which typically ignore the fact that human environments are fundamentally social.

Implications for AI Practitioners

For developers building agentic systems, StarDojo highlights several critical gaps in current models. First, long-term memory and planning remain weak points. Many LLMs can generate a reasonable sequence of actions for a single day, but fail to maintain coherent strategies across a full in-game season. Second, social reasoning is often brittle. Models may correctly identify a villager's favorite gift but fail to understand that giving gifts too frequently can feel insincere—a nuance that human players grasp intuitively.

Practitioners should view StarDojo as a stress test for their agent architectures. If a model cannot manage a simplified farm-and-social simulation, it is unlikely to handle real-world tasks like coordinating a team project or managing a household budget. The benchmark also suggests that future systems may need dedicated modules for goal persistence, social modeling, and value-based decision-making—capabilities that current end-to-end LLMs lack.

Finally, StarDojo represents a shift toward ecological validity in AI evaluation. Benchmarks that mimic real-world complexity will become essential as AI moves from chat interfaces to autonomous action. Practitioners should expect more such environments—from cooking simulations to city management games—as the field matures.

Key Takeaways

StarDojo is the first benchmark to jointly evaluate production and social behaviors in an open-ended, persistent environment using Stardew Valley.
It reveals that current multimodal LLMs struggle with long-horizon planning, multi-objective trade-offs, and nuanced social reasoning.
For AI practitioners, the benchmark underscores the need for agent architectures with robust memory, goal management, and social modeling.
The trend toward ecologically valid, game-based benchmarks will likely accelerate as autonomous agents move toward real-world deployment.

Read Original Article on Arxiv CS.AI

arxivpapersagentsbenchmarkmultimodal