TryOnDiffusion: A Tale of Two UNets

TryOnDiffusion: A Tale of Two UNets

https://t.me/reading_ai, @AfeliaN

πŸ—‚οΈ Project Page

πŸ“„ Paper

πŸ—“ Date: 14 Jun 2022

Main idea

Fig 1. Main idea
  • Motivation: the main challenge is to preserve both effective pose and shave variation and garment details.
  • Solution: TryOnDiffusion - a diffusion-based architecture that unifies two UNets . As an input TryOnDiffusion takes two images - a target person image and an image of a garment worn by another person. As output the target person wearing the garment is obtained. It works in high resolution (1024 x 1024) that can work for diverse body shapes while preserving garment details.

Pipeline

Fig 2. Pipeline

The goal is based on given an image of a person I_p and an image of a different person in a garment β†’ generate an image of the initial person wearing the garment.

The model was trained on paired data where we have 2 images I_p and I_g - images of the same person wearing the same garment but in two different oses.

  1. Preprocessing step.

β†’

  • segment the garment using a parsing map
  • generate clothing-agnostic RGB image which removes the original clothing, but retains the personal identity.

So as the result as conditioning the model takes the following parameters:

  • clothing-agnostic RGB
  • 2D keypoints of source
  • 2D keypoints of target
  • image of segmented garment

2. As a model the authors used cascaded diffusion models.

  • The base diffusion model is parametrized as a 128x128 Parallel Unet that use presented on the previous step conditions and returns 128x128 results.
Fig 3. 128x128 Parallel-UNet
  • The second diffusion model 128x128 β†’ 256x256 SR takes as conditioning the results of the previous model and try-on conditional inputs at 256x256 resolution. During training, the authors used gt image downsampled to the 128x128 resolution.
Fig 4. 256x256 Parallel-UNet
  • The final SR diffusion model 256x256 β†’ 1024x1024 is parametrized as Efficient-UNet introduced by Imagen. This stage is a pure super-resolution model, with no try-on conditioning.

3. More about Parallel-UNet

In this part we will discuss the conditioning. As the authors found channel-wise concatenation can not handle complex transformations such as garment warping.

Fig 5. Cross-attention vs concatenetaion
Fig 6. Cross-attention vs concatenation

To deal with this problem authors proposed a cross attention mechanism. The Query is the flattened features of noisy image and Keys and Values are the flattened features of the segmented garment.

  • Instead of warping the garment to the target body and then blending with the target person the authors combine the two operations into a single pass.
  • The person-UNet takes the closing-agnostic RGB and the noisy image as input (directly concatenated).
  • The garment-UNet takes the segmented garment as input and fuses via cross attention. To save model parameters, the authors early stop the garment-UNet after the 32Γ—32 upsampling block, where the final cross attention module in person-UNet is done.

Implementation details

Dataset:

  • train: collected paired training dataset of 4 million samples
  • test: collected 6K unpaired samples, VITON-HD dataset

Compared with: TryOnGan, SDAFN, HR_VITON

Fig 7. Comparisons (images)
Fig 8. Comparisons (images)

Metrics: FID, KID, user-study

Fig 9. Metrics
Fig 10. User studies

Pros and cons

Pros: quite impressive results, high-resolution

Limitations:

  • Representing identity via clothing agnostic RGB is not ideal, since sometimes it may preserve only part of the identity (tattoos won’t be visible in this representation, or specific muscles structure)
  • Train and test datasets have mostly clean uniform backgrounds so it is unknown how the method performs with more complex backgrounds.
  • This work focused on upper body clothing and the authors have not experimented with full-body try on.
Fig 11. Faulure cases

Results

Fig 12. Some results


Report Page