Research2026-06-18

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

arXiv:2606.18847v1 Announce Type: new Abstract: To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while...

The Memory Gap in Embodied AI

A new preprint from arXiv, "WorldLines," tackles a critical blind spot in AI research: how do we build and evaluate embodied agents that can operate coherently over long time horizons in real homes? The paper introduces a benchmark and modeling framework specifically for "stateful" agents—systems that must remember not just conversation history, but the physical state of the world and their own past interactions with it.

What Happened

The researchers identified that existing long-term memory benchmarks for AI are overwhelmingly focused on language-centric tasks like retrieval-augmented question answering or dialogue history. These benchmarks test whether an agent can recall a fact from a conversation three hours ago, but they miss a fundamentally different challenge: remembering that the coffee machine was left on, that the user mentioned a repair appointment next Tuesday, or that the living room rug was moved yesterday. WorldLines fills this gap by creating a benchmark that requires agents to maintain and update a dynamic model of both user preferences and physical world states across extended, multi-session interactions.

Why It Matters

This work addresses the elephant in the room for embodied AI deployment. Current systems like home robots or smart home assistants operate in near-stateless fashion—they respond to immediate queries or commands but lack persistent situational awareness. A robot that cannot remember it already checked the mail will check it again. An assistant that forgets you asked it to monitor the plant watering schedule is useless. WorldLines formalizes this as a distinct research problem, separate from both short-term task completion and long-term language memory.

The implications are significant. First, it shifts the evaluation metric from "can you answer a question?" to "can you maintain a coherent world model across interruptions, time gaps, and changing contexts?" Second, it exposes that current large language models, even with long context windows, are poorly suited for this task—they lack mechanisms for updating beliefs about physical states without explicit retraining or complex retrieval pipelines. Third, it suggests that future embodied agents will need hybrid architectures: neural memory for language, symbolic state tracking for physical objects, and probabilistic reasoning for user routines.

Implications for AI Practitioners

For engineers building home robots, smart assistants, or any long-running autonomous system, WorldLines offers a concrete evaluation methodology. Practitioners should consider:

Memory architecture design: Separate short-term task memory from long-term world-state memory. A simple vector database for user preferences won't capture that the kitchen light is currently off.
State update mechanisms: Agents need explicit routines for belief revision—if the user says "I moved the vase," the agent must update its spatial model, not just log the conversation.
Benchmarking rigor: Current leaderboards for embodied AI focus on single-session tasks. WorldLines provides a template for evaluating multi-session competence, which is essential for real-world deployment.

Key Takeaways

WorldLines introduces a benchmark and modeling framework for embodied agents that must maintain persistent memory of physical world states and user routines, not just conversation history.
Existing long-term memory evaluations are language-centric; this work targets the distinct challenge of stateful, real-world interaction over extended periods.
The research implies that current LLM-based agents are insufficient for long-horizon embodied tasks without specialized memory and state-tracking architectures.
For AI practitioners, the work highlights the need to separate language memory from physical state memory and to design explicit belief-update mechanisms for deployed agents.

Read Original Article on Arxiv CS.AI

arxivpapersagentsbenchmark