In-Context Learning Unlocked for Diffusion Models

https://t.me/reading_ai, @AfeliaN

🗂️ Project Page

📄 Paper

📎 GitHub

🗓 Date: 2023

Main idea

Motivation: Diffusion models require a substantial amount of data and extensive training time to learn new tasks. Recently published ControlNet partly solves the problem, but still we need to finetune the model for each individual task.
Solution: Prompt Diffusion, a framework for enabling in-context learning. Given a pair of task-specific example images and text guidance, the model automatically understands the underlying task and performs the same task on a new query image following the text guidance.

The model was trained jointly on 6 different vision-language tasks:

three forward tasks (image processing tasks):

images to depth maps,
images to hed maps,
images to segmentation maps

2. three inverse tasks:

depth maps to images
hed maps to images
segmentation maps to images

And then additionally tested on unseen tasks and image editing possibilities.

Pipeline

The model incorporates both text and image inputs.

For image inputs, the authors concatenate the example pair of images in the channel dimension and then project the concatenated example pair and image query into equivalent dimensional embeddings via independent stacked convolutional layers. We compute the sum of the two embeddings and feed it into the ControlNet branch.

Implementation details

Datasets: InstructPix2Pix dataset + Midas for depth and normals + Unifromer for segmentation maps + Cannny edges + HED maps from Hed boundure detector.

Tasks:

three forward tasks (image processing tasks):

images to depth maps,
images to hed maps,
images to segmentation maps

2. three inverse tasks:

depth maps to images
hed maps to images
segmentation maps to images

Compared with: ControlNet

Pros and cons

Pros: One of the advantages of this model is its ability to perform new tasks without the need for additional tuning. Additionally, the model offers relatively fast image editing capabilities.
Limitations: I am not sure, but one potential limitation of the model is its performance in preserving the identity of individuals during editing. For example, here we can see changes in a woman's appearance and I think if we try to change something in an exact person - we can lose personality features more. But unfortunately, I didn’t test it so if you have some examples you may share them with me :)

Results

In-Context Learning Unlocked for Diffusion Models

Report Page