🏛️Pantheon360:
Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
CVPR 2026
- 1University of Southern California
- 2National Yang Ming Chiao Tung University
- 3Cornell University
- 4Bosch Research
*Equal contribution †Equal advising
Abstract
Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial–temporal consistency—constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Beyond single-image generation, we are the first video diffusion model to support 360° interpolation, enabling seamless chaining of video segments to produce extended, coherent long-form videos. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.
Motivation
Motivation for Using 360° Images for Generation. Left: When traversing to the back of the room, 360° anchor frames provide complete scene context, enabling accurate generation of occluded regions. In contrast, perspective anchor frames have limited field-of-view and must hallucinate unseen areas, leading to significant artifacts. Right: Generating 360° outputs in a single pass ensures global coherence and cross-view consistency. Our method maintains consistent object structures (red boxes highlight the same door/cabinet viewed from different angles), while GEN3C's perspective-based generation produces geometrically inconsistent results across views.
Our proposed framework
Given sparse 360° input frames, we first crop them into perspective views and reconstruct a 3D point cloud cache using foundation models (e.g., PI3, VGGT). We then render this cache along the target camera trajectory to produce a geometryonly equirectangular video Vgeo, which is encoded into latent space and concatenated with noised latents for geometric conditioning. Simultaneously, CLIP features extracted from 8 perspective crops of the first frame provide semantic conditioning via cross-attention. Our fine-tuned video diffusion model leverages both geometric and semantic conditions to generate temporally consistent, photorealistic 360° videos with precise trajectory control. For interpolation tasks, we employ dual-anchor latent fusion to blend information from both start and end frames, ensuring smooth transitions between distant viewpoints.
Novel View Synthesis on Google Maps Street View
Our method produces geometrically accurate renderings across different viewing angles with consistent structures. GEN3C suffers from ghosting artifacts, geometric distortions, and inter-view inconsistencies.
3D Point Cloud Reconstruction Quality
We reconstruct 3D point clouds from generated videos using PI3. Our method yields dense, structurally coherent reconstructions (right), while GEN3C produces sparse, fragmented results (left), demonstrating our superior 3D consistency.
Two-View 360° Interpolation
User-defined camera trajectories between sparse 360° anchors
Extended Trajectory Generation
Chain multiple panoramas with precise camera path control
Single-View Camera Exploration
Navigate scenes with user-defined camera motion
Stabilization via Trajectory Replanning
Redefine smooth camera paths from shaky 360° footage
Qualitative Comparisons
Visual comparison with state-of-the-art perspective video generation methods across different views