NaRCan:
Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
NeurIPS 2024

Abstract

We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and a diffusion prior to generating high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences.

Video


NaRCan = Natural Refined Canonical Image with Diffusion Prior


Video representation with diffusion prior. Given an RGB video, we can represent the video using a canonical image. However, the canonical image and reconstruction training process focuses only on reconstruction quality and could produce an unnatural canonical image. This could cause problems with downstream tasks such as prompt-based video editing. In the bottom example, if the bear is distorted in the canonical image, the image editor, such as ControlNet, may not recognize it and could introduce an irrelevant object instead. In this paper, we propose introducing the diffusion prior from a LoRA fine-tuned diffusion model to the training pipeline and constraining the canonical image to be natural. Our method facilitates several downstream tasks, such as (a) video editing, (b) dynamic segmentation, and (c) video style transfer.


Our proposed framework

Given an input video sequence, our method aims to represent the video with a natural canonical image, which is a crucial representation for versatile downstream applications. (a) First, we fine-tune the LoRA weights of a pre-trained latent diffusion model on the input frames. (b) Second, we represent the video using a canonical MLP and a deformation field, which consists of homography estimation and residual deformation MLP for non-rigid residual deformations. By relying entirely on the reconstruction loss, the canonical MLP often fails to represent a natural canonical image, causing problems for downstream applications. E.g., image-to-image translation methods such as ControlNet may not be able to recognize that there is a train in the canonical image. (c) Therefore, we leverage the fine-tuned latent diffusion model to regularize and correct the unnatural canonical image into a natural one. Specifically, we sophistically design a noise scheduling corresponding to the frame reconstruction process. (d) The natural and artifacts-free canonical image can then be facilitated to various downstream tasks such as video style transfer, dynamic segmentation, and editing, such as adding handwritten characters.


Diffusion prior for canonical image refinement

(Left) Without diffusion prior to regularizing the canonical image, the training process relies only on the frame reconstruction and could sacrifice the faithfulness of the canonical image. (Right) Our fine-tuned diffusion prior effectively corrects the canonical image to faithfully represent the input frames and results in natural canonical images.


Noise and diffusion prior update scheduling


Initially, our model fits object outlines before the fields converge and without the diffusion prior, resulting in unnatural elements in the canonical image due to complex non-rigid objects. Upon introducing the diffusion prior with increased noise and update frequency, the model learns to generate natural, high-quality images, leading to convergence. Thus, the strength of noise and the update frequency will also decrease. Moreover, it's worth mentioning that update scheduling cuts training time from 4.8 hours to 20 minutes.


Text-guided video-to-video translation

Edited video Canonical image Edited canonical image

Text prompt: A Van Gogh-style cartoon bear walking in the forest.
Baseline method (left) vs NaRCan (right). Try selecting different methods and scenes!

DTU/scan31 DTU/scan45 LLFF/fern LLFF/horns Re10K/sofa

Adding handwritten characters

Edited video Canonical image Edited canonical image

Baseline method (left) vs NaRCan (right). Try selecting different methods and scenes!

DTU/scan31 DTU/scan45 LLFF/fern LLFF/horns Re10K/sofa

Segmentation-based tracking

Edited video Canonical image Segmentation mask

Baseline method (left) vs NaRCan (right). The segmentation mask is acquired using the Segment Anything Model (SAM) based on the learned canonical image and propagated to the sequence. Try selecting different methods and scenes!

DTU/scan31 DTU/scan45 LLFF/fern LLFF/horns Re10K/sofa

User study

Our method achieves the highest user preference ratios across all three aspects, compared with MeDM, CoDeF, Hashing-nvd.


Citation

Acknowledgements

This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

The website template was borrowed from Michaël Gharbi and Ref-NeRF.