Explorative Inbetweening of Time and Space

Enabling bounded generation of a pre-trained image-to-video model
without any tuning and optimization


1Max Planck Institute for Intelligent Systems, Tuebingen, Germany
2Adobe   3University of California San Diego

Bounded generation in three scenarios: 1) Generating subject motion with the two bound images capturing a moving subject. 2) Synthesizing camera motion using two images captured from different viewpoints of a static scene. 3) Achieving video looping by using the same image for both bounds. We propose a new sampling strategy, called Time Reversal Fusion, to preserve the inherent generalization of an image-to-video model while steering the video generation towards an exact ending frame.



Identical Bound (Video Looping, Cinemagraph)

Top row: Two input frames. Bottom row: Our generated results. More results

View Bounds (Camera Motion Generation)

Top row: Two input frames. Bottom row: Our generated results. More results

Dynamic Bounds (Subject / Scene Motion Generation)

Top row: Two input frames. Bottom row: Our generated results.

Abstract

We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames.

Method

Comparison

Identical Bound: Input (left), Text2cinemagraph (middle), Ours (right). More Comparisons

View Bound: Input (left), Du et al. (middle), Ours (right). More Comparisons

Dynamics Bound: Input (left), FILM-Net (middle), Ours (right). More Comparisons

Acknowledgment

The authors would like to thank Tsvetelina Alexiadis, Taylor McConnell, and Tomasz Niewiadomski for the great help with perceptual user study. Special thanks are also due to Liang Wendong, Zhen Liu, Weiyang Liu, Zhanghao Sun, Yuliang Xiu, Yao Feng, Yandong Wen for their proofreading and insightful discussions