Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion (CVPR 2026)

🏛️Pantheon360:
Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
CVPR 2026

¹University of Southern California
²National Yang Ming Chiao Tung University
³Cornell University
⁴Bosch Research

^*Equal contribution ^†Equal advising

Abstract

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial–temporal consistency—constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Beyond single-image generation, we are the first video diffusion model to support 360° interpolation, enabling seamless chaining of video segments to produce extended, coherent long-form videos. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.

Motivation

Motivation for Using 360° Images for Generation. Left: When traversing to the back of the room, 360° anchor frames provide complete scene context, enabling accurate generation of occluded regions. In contrast, perspective anchor frames have limited field-of-view and must hallucinate unseen areas, leading to significant artifacts. Right: Generating 360° outputs in a single pass ensures global coherence and cross-view consistency. Our method maintains consistent object structures (red boxes highlight the same door/cabinet viewed from different angles), while GEN3C's perspective-based generation produces geometrically inconsistent results across views.

Our proposed framework

Given sparse 360° input frames, we first crop them into perspective views and reconstruct a 3D point cloud cache using foundation models (e.g., PI3, VGGT). We then render this cache along the target camera trajectory to produce a geometryonly equirectangular video Vgeo, which is encoded into latent space and concatenated with noised latents for geometric conditioning. Simultaneously, CLIP features extracted from 8 perspective crops of the first frame provide semantic conditioning via cross-attention. Our fine-tuned video diffusion model leverages both geometric and semantic conditions to generate temporally consistent, photorealistic 360° videos with precise trajectory control. For interpolation tasks, we employ dual-anchor latent fusion to blend information from both start and end frames, ensuring smooth transitions between distant viewpoints.

Novel View Synthesis on Google Maps Street View

Our method produces geometrically accurate renderings across different viewing angles with consistent structures. GEN3C suffers from ghosting artifacts, geometric distortions, and inter-view inconsistencies.

3D Point Cloud Reconstruction Quality

We reconstruct 3D point clouds from generated videos using PI3. Our method yields dense, structurally coherent reconstructions (right), while GEN3C produces sparse, fragmented results (left), demonstrating our superior 3D consistency.

Two-View 360° Interpolation

User-defined camera trajectories between sparse 360° anchors

Extended Trajectory Generation

Chain multiple panoramas with precise camera path control

Single-View Camera Exploration

Navigate scenes with user-defined camera motion

Stabilization via Trajectory Replanning

Redefine smooth camera paths from shaky 360° footage

Qualitative Comparisons

Visual comparison with state-of-the-art perspective video generation methods across different views

Citation

@inproceedings{chen2026pantheon360,
    title={Pantheon360: Taming Digital Twin Generation via 3D-Aware 360 Video Diffusion},
    author={Ting-Hsuan Chen and Ying-Huan Chen and Tao Tu and Jie-Ying Lee and Cho-Ying Wu and Fangzhou Lin and Hengyuan Zhang and David Paz and Xinyu Huang and Yuliang Guo and Yu-Lun Liu and Yue Wang and Liu Ren},
    booktitle={CVPR},
    year={2026}
}