Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
🗂️ Project Page
📄 Paper
Main idea
An LDM is firstly pre-trained on images only → then, the image generator is turned into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. The model can generate multiple minute-long videos in high resolution.
Pipeline
Image diffusion → video diffusion
The additional temporal neural network layers are introduced. They are interleaved with the existing spatial layers and learn to align individual frames in a temporally consistent manner. These additional temporal layers define the video-aware temporal backbone of the model

The spatial layers interpret the video as a batch of independent images (by shifting the temporal axis into the batch dimension), and for each temporal mixing layer, we reshape back to video dimensions as follows

In practice, we implement two different kinds of temporal mixing layers:
- temporal attention
- residual blocks based on 3D convolutions
The video-aware temporal backbone is then trained using the same noise schedule as the underlying image model, and, importantly, the authors fix the spatial layers and only optimize the temporal layers.
Temporal autoencoder
- Problem: we still have the LDM autoencoder, which was trained only on images, and as a result, it produces flickering artifacts when encoding and decoding a temporally coherent sequence of images.
- Solution: additional temporal layers for the autoencoder’s decoder, which is finetuned on video data with a (patch-wise) temporal discriminator built from 3D convolutions

What about the long-term generation?
The authors trained models as prediction models given a number of (first) S context frames. It is implemented by introducing a temporal binary mask which masks the T − S frames the model has to predict, where T is the total sequence length. This mask and encoded video frames are fed into the model for conditioning. The frames are encoded with LDM’s image encoder E, multiplied by the mask, and then fed (channel-wise concatenated with the masks) into the temporal layers after being processed with a learned downsampling operation
During inference, for generating long videos, the sampling process may be applied iteratively, reusing the latest predictions as a new context.
Temporal Interpolation for High Frame Rates
Besides high resolution, we need high temporal resolution. It can be achieved by using two-steps synthesis process. The first step was already described and it generates key frames with large semantic changes.
For the second part, an additional model is introduced to interpolate between given key frames.
The training approach for prediction and interpolation models is inspired by recent works, such as
- Flexible diffusion modeling of long videos
- Diffusion models for video prediction and infilling
- Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation
Temporal Fine-tuning of SR Models
Additionally, the authors used DM to upscale the video outputs by 4x.
Since upsampling video frames independently would result in poor temporal consistency, the authors also make this SR model video-aware.
Since the upscaler operates locally, the authors conduct all upscaler training efficiently on patches only and later apply the model convolutionally
The final model's structure:

Implementation details
Dataset:
- An in-house dataset of real driving scene (RDS) videos (683,060 videos of 8 seconds each at resolution 512 × 1024 (H × W) and frame rate up to 30 fps)
- WebVid-10M (image LDM → video LDM) → 10.7M video-caption pairs with a total of 52K video hours, resolution 320x512
- Mountain Biking dataset
Compared with: Long Video GAN (LVG), CogVideo, MagicVideo, Make-A-Video, CODIVA, NÜWA
Metrics: FID, FVD, CLIP-SIM, Video Ineption, human evaluation
Pros and cons
- Pros: the pretrained on a large image datasource LDM may be used. It is HR and temporally coherent.
- Limitations: it is pretty large, still not perfect consistency
Results:




