Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Main idea

Fig 1. Temporal consistency for batch generation

An LDM is firstly pre-trained on images only → then, the image generator is turned into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. The model can generate multiple minute-long videos in high resolution.

Pipeline

Image diffusion → video diffusion

The additional temporal neural network layers are introduced. They are interleaved with the existing spatial layers and learn to align individual frames in a temporally consistent manner. These additional temporal layers define the video-aware temporal backbone of the model

Fig 2. Temporal layers inserted into Diffusion Model

The spatial layers interpret the video as a batch of independent images (by shifting the temporal axis into the batch dimension), and for each temporal mixing layer, we reshape back to video dimensions as follows

Fig 3. Batch rearrangment for passing into temporal and spatial layers

In practice, we implement two different kinds of temporal mixing layers:

temporal attention
residual blocks based on 3D convolutions

The video-aware temporal backbone is then trained using the same noise schedule as the underlying image model, and, importantly, the authors fix the spatial layers and only optimize the temporal layers.

Temporal autoencoder

Problem: we still have the LDM autoencoder, which was trained only on images, and as a result, it produces flickering artifacts when encoding and decoding a temporally coherent sequence of images.
Solution: additional temporal layers for the autoencoder’s decoder, which is finetuned on video data with a (patch-wise) temporal discriminator built from 3D convolutions

Fig 4. Autoencoder decoder for temporal consistency

What about the long-term generation?

The authors trained models as prediction models given a number of (first) S context frames. It is implemented by introducing a temporal binary mask which masks the T − S frames the model has to predict, where T is the total sequence length. This mask and encoded video frames are fed into the model for conditioning. The frames are encoded with LDM’s image encoder E, multiplied by the mask, and then fed (channel-wise concatenated with the masks) into the temporal layers after being processed with a learned downsampling operation

During inference, for generating long videos, the sampling process may be applied iteratively, reusing the latest predictions as a new context.

Temporal Interpolation for High Frame Rates

Besides high resolution, we need high temporal resolution. It can be achieved by using two-steps synthesis process. The first step was already described and it generates key frames with large semantic changes.

For the second part, an additional model is introduced to interpolate between given key frames.

The training approach for prediction and interpolation models is inspired by recent works, such as

Temporal Fine-tuning of SR Models

Additionally, the authors used DM to upscale the video outputs by 4x.

Since upsampling video frames independently would result in poor temporal consistency, the authors also make this SR model video-aware.

Since the upscaler operates locally, the authors conduct all upscaler training efficiently on patches only and later apply the model convolutionally

The final model's structure:

Implementation details

Dataset:

An in-house dataset of real driving scene (RDS) videos (683,060 videos of 8 seconds each at resolution 512 × 1024 (H × W) and frame rate up to 30 fps)
WebVid-10M (image LDM → video LDM) → 10.7M video-caption pairs with a total of 52K video hours, resolution 320x512
Mountain Biking dataset

Compared with: Long Video GAN (LVG), CogVideo, MagicVideo, Make-A-Video, CODIVA, NÜWA

Metrics: FID, FVD, CLIP-SIM, Video Ineption, human evaluation

Pros and cons

Pros: the pretrained on a large image datasource LDM may be used. It is HR and temporally coherent.
Limitations: it is pretty large, still not perfect consistency

Results: