Research2026-07-03

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

Originally published byArxiv CS.AI

arXiv:2607.01962v1 Announce Type: cross Abstract: We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view...

What Happened

Researchers have introduced NeoMap, a training-free framework for synthesizing novel views from single images or monocular videos. The core insight is that existing pre-trained video diffusion models already possess latent capabilities for novel view synthesis, but these capabilities remain dormant without proper conditioning. NeoMap unlocks this ability by mapping input images into a geometric-aware latent space that guides the video model’s generation process, effectively bypassing the need for explicit 3D reconstruction or fine-tuning on multi-view datasets.

The method operates by extracting depth-aware features from the input image or video, then using these features to condition a pre-trained video diffusion model during inference. This allows the model to generate consistent novel views without requiring any additional training data or model parameter updates. The approach works for both static scenes from a single image and dynamic scenes from monocular video, producing temporally coherent video outputs.

Why It Matters

This development addresses a fundamental bottleneck in 3D vision and content creation: generating plausible new viewpoints from limited visual input. Traditional approaches either require expensive 3D reconstruction pipelines (NeRF, Gaussian Splatting) or extensive multi-view training data. NeoMap’s training-free paradigm offers several significant advantages:

First, it dramatically reduces computational overhead. Practitioners can leverage existing video models without the GPU-hours typically required for fine-tuning or training specialized architectures. Second, it maintains compatibility with any pre-trained video diffusion model, meaning improvements in base models automatically translate to better novel view synthesis. Third, the method handles both static and dynamic scenes, bridging a gap that often required separate specialized systems.

For industries like virtual production, e-commerce visualization, and robotics simulation, this could lower the barrier to generating 3D-consistent content. A single product photo could yield multiple viewing angles; a short drone video could generate fly-around perspectives.

Implications for AI Practitioners

Immediate applicability: Teams working with existing video diffusion models (Stable Video Diffusion, Sora-like models) can likely integrate NeoMap’s conditioning approach without retraining. The method’s architecture-agnostic design suggests it could be adapted to newer models as they emerge. Trade-offs to consider: Training-free methods typically sacrifice some quality compared to specialized trained models. Practitioners should evaluate whether the convenience outweighs potential artifacts in their use case. The method also inherits biases and limitations from the underlying video model, including temporal consistency issues and object hallucination. Research direction: This work validates the hypothesis that large generative models encode geometric priors more richly than previously assumed. Practitioners exploring model capabilities should consider probing pre-trained models for latent skills before building specialized training pipelines. Deployment considerations: For real-time applications, the inference-time cost of depth estimation and latent mapping may still be non-trivial. Optimization for edge devices or latency-sensitive workflows remains an open challenge.

Key Takeaways

NeoMap achieves novel view synthesis from single images or monocular video without any training, using only inference-time conditioning of pre-trained video models
The method reduces computational requirements dramatically compared to NeRF or multi-view training approaches, making 3D-consistent generation more accessible
Practitioners can likely integrate this approach with existing video diffusion pipelines without retraining, though quality trade-offs exist versus specialized models
The work demonstrates that pre-trained video models harbor latent 3D understanding, encouraging further exploration of emergent capabilities in foundation models

Read Original Article on Arxiv CS.AI

arxivpapers