BeClaude
Research2026-05-12

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Source: Arxiv CS.AI

arXiv:2605.10106v1 Announce Type: cross Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this...

arxivpapersreasoningagents