Research2026-07-02

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

Originally published byArxiv CS.AI

arXiv:2607.00218v1 Announce Type: cross Abstract: Vision-language models (VLMs) are now proposed as runtime safety guards for embodied agents in homes and factories. A deployable guard must catch genuinely unsafe situations while avoiding unnecessary intervention on routine but superficially...

The Blind Spot in Embodied AI Safety

A new benchmark, EgoSafetyBench, directly addresses a critical gap in the deployment of vision-language models (VLMs) as runtime safety monitors for physical robots. The core problem is simple but dangerous: current VLMs are trained on internet-scale data, which is fundamentally different from the first-person, egocentric video streams generated by a robot navigating a home or factory floor. A VLM that can caption a photo of a kitchen perfectly may fail to recognize that a robot arm is about to crush a child’s hand, because the visual perspective, motion blur, and contextual cues are entirely different.

EgoSafetyBench provides a diagnostic dataset of egocentric video clips, each labeled for whether the situation is genuinely unsafe (e.g., a person stumbling near machinery) or merely a routine but superficially concerning action (e.g., a person reaching for a tool). This distinction is the crux of the safety challenge. A guard that over-intervenes—halting the robot for every human movement—becomes unusable. A guard that under-intervenes fails at its primary job.

Why This Matters for Deployment

The stakes for embodied AI safety are uniquely high. Unlike a chatbot that can produce a harmful text, a robot with a faulty safety guard can cause physical injury or property damage. The industry has largely focused on pre-deployment safety (training data curation, red-teaming) and post-hoc monitoring (logging failures). EgoSafetyBench targets the runtime layer—the split-second decisions a VLM must make while the robot is acting.

This benchmark reveals a fundamental limitation of current VLMs: they lack robust temporal and spatial reasoning in egocentric contexts. A model might correctly identify a knife as dangerous in a static image, but fail to recognize that a person’s hand moving toward the knife at high speed requires immediate intervention. The benchmark forces models to reason about sequences and proximity, not just object presence.

Implications for AI Practitioners

For teams building embodied systems, this work is a wake-up call. First, it highlights that a VLM’s performance on standard benchmarks (e.g., VQA, captioning) is a poor predictor of its utility as a runtime safety guard. Practitioners should treat EgoSafetyBench as a mandatory evaluation step before deploying any VLM-based guard in a physical environment.

Second, the benchmark’s structure suggests that fine-tuning on egocentric data is necessary but not sufficient. The model must learn to distinguish between dangerous and merely unusual actions—a nuanced judgment that requires understanding human intent and typical workflows. This may require new training paradigms, such as contrastive learning between safe and unsafe egocentric sequences.

Third, the work implies a need for hierarchical safety architectures. A VLM guard should not be the sole safety layer; it should be complemented by low-level hardware limiters (e.g., torque limits, speed governors) and high-level task planners. EgoSafetyBench provides a way to measure where the VLM layer adds value versus where it introduces new failure modes.

Key Takeaways

EgoSafetyBench fills a critical evaluation gap by testing VLMs on egocentric video, which is fundamentally different from the third-person data they are typically trained on.
Runtime safety guards must avoid both false positives and false negatives, and current VLMs struggle with this balance in dynamic, first-person contexts.
Practitioners should not trust standard VLM benchmarks as proxies for embodied safety performance; dedicated egocentric safety evaluation is essential.
A multi-layered safety architecture (hardware limits + VLM guard + task planner) is likely necessary, and EgoSafetyBench can help quantify the contribution of the VLM layer.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmarksafety