Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond.
https://t.me/reading_ai, @AfeliaN🗂️ Project Page
📄 Paper
📎 GitHub
🗓 Date: 26 Apr 2023
Main idea
- Motivation: Despite diffusion models having significant success, using negative prompts has its own limitations, particularly when there is an overlap between the main and negative prompts. Additionally, current solutions for generating 3D assets from given text or an image often lead to the Janus (multi-head) problem.
- Solution: Perp-Neg, a new algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm.
Pipeline
The problem of semantic overlap
Usually, positive and negative prompts for conditioning of diffusion models are overlapped. In an ideal situation, we need 2 independent text prompts while in practice it is barely possible. This can lead to undesired results as it is shown in the image below (In the second row of images, we can clearly observe the key concepts requested in the main text prompt (respectively “armchair”, “sunglasses”, “crown”, and “horse”) are removed when those concepts appear in the negative prompts).

To leverage this problem the authors suggest using a perpendicular gradient and let us discuss, what it means.
Recall when c1 and c2 are independent, both of them possess a denoising score component

But in case of discussed overlapping it is better to improve the score component in the following way.
Considering the geometrical interpretation of $e_{\Theta}^i$ a natural solution is to find the perpendicular gradient of $e_{\Theta}^i$ as the independent component of $e_{\Theta}^2$

The most important property of the suggested perpendicular gradient is that the component of $e_{\Theta}^1$ won’t be affected by the additional prompt.
So in the image below the main sampling pipeline is illustrated:


3D
And how it can be used in 3D? There are several works, based on implying diffusion priors to build 3D assets, such as DreamFusion, Magic3D and so on. All of these works have so-called Janus (multi-faced) problem. For instance, when the model is asked to generate a 3D sample of a person/animal, the generated object has multiple faces instead of having a back view.

Some works, such as 3DFuse, try to incorporate this problem by giving additional information about depth, but the results are still imperfect.
Another proposed solution is to use view-dependent prompting (e.g. adding back view, side view or overhead view with respect to camera position), but It also does not fully solve the problem,
So authors propose the following method. First of all, they defined txt_{back}, txt_{side}, txt_{fromt} as the main text prompts appended by back, side, and front views, respectively.
Then simple view-dependent prompts are replaced with new sets of positive and negative prompts:

The authors also observed increasing the weight of the negative prompt makes the algorithm focus more on avoiding that view, acting as a pose factor.
Additionally, to interpolate between the side and back views, authors use the following embedding as the positive prompt:

and for negative:

To interpolate between the front and side view:

and the corresponding negative prompts:

Additionally, authors improve the SDS loss:

Implementation details
Experinment. Semantic aligned 2D generation
Compared with: Stable diffusion, Compositional Energy-based Model (CEMB)
Results




Pros: such usage of negative prompts improves the results in 2D and 3D. It doesn’t require to have difficult prompt engineering process and solves the problem of multiple heads in 3D generation
Results
2D images

3D assets
