ATT3D: Amortized Text-To-3D Object Synthesis

https://t.me/reading_ai, @AfeliaN

🗂️ Project Page

📄 Paper

🗓 Date: 6 Jun 2023

Main idea

Motivation: existing models for text→3D tasks need to re-train the NeRF model for each new text prompt (see Magic3D). It is always time consuming and computationally costly.
Solution: Amortized text-to-3D (ATT3D) - unified model is trained on many prompts simultaneously. It takes around 1 sec on a single GPU to produce an accurate object. The main idea is to use a modulation technique for encoding in Intasnt NGP structure.

As the results the model:

Generalized to new prompts
Can use interpoation between prompts
Amortized over settings other than text prompts

Pipeline

The main idea of this work is to use amortized optimization. This technique use learning to predict solutions when we repeatedly solve similar instances of the same problem.

TT3D process is split into two stages:

training: optimize one model offline to generate 3D objects for many different text prompts simultaneously. This amortizes optimization over the prompts, by sharing work between similar instances
cheap inference: user-facing stage uses our amortized model in a simple feed-forward pass to quickly generate an object given text, with no further optimization required.

Architecture.

The resulting model consists of several parts:

Mapping network
NeRF
Spatial grid features with parameters w

The major idea is to use modulation computed from text using a mapping network and then compute encoding in each point to use it in NeRF prediction and summarizing algorithm may be presented as following: in each optimization step, the authors sample several prompts and produce their potentially cached text embeddings, which are used to compute the modulations. Additionally, the camera poses and rendering conditions are sampled. These are combined with the NeRF module to render images and apply SDS loss.

In addition to amortizing optimization over many prompts, the authors presented a possibility to amortize over other variables like the choice of guidance weight, regularizers, data augmentation, and so on.

Fig 4. Interpolation over different parameters

Implementation details

Text Prompt datasets:

DreamFusion dataset (27 prompts from DeamFusion's main paper)
Compositional dataset: a compositional prompt set by composing fragments with the template “a{animal}{activity}{theme}”. Using this template the authors created a small pig-prompts and a larger animal-prompts datasets.

Metrics:

Computational Cost: measure the number of rendered frames used for training (normalized by the number of prompts). Specifically, this is the number of optimization iterations times batch size divided by the total number of prompts in the dataset.

Quality: CLIP R-precision, CLIP R-probability

Models: Instant NGP, CLIP (embed text)

Mapping networks:

concatenation option
hypernetwork option
attention option

Pros and cons

Pros: around 1 sec for a new prompt for 1 GPU, generalization over different prompts, settings, and possibilities to interpolate.
Limitations: lack of diversity, problems with large text prompts, also as we do not optimize mesh-like in Magic3D pipeline - we lose in quality.

Results

Fig 9. Comparison to pre-prompt (for each prompt) training

Fig 10. Comparison to pre-prompt (for each prompt) training

ATT3D: Amortized Text-To-3D Object Synthesis

Report Page