ATT3D: Amortized Text-To-3D Object Synthesis

ATT3D: Amortized Text-To-3D Object Synthesis

https://t.me/reading_ai, @AfeliaN

πŸ—‚οΈ Project Page

πŸ“„ Paper

πŸ—“ Date: 6 Jun 2023

Main idea

Fig 1. Main idea: difference between existing methods and ATT3D
  • Motivation: existing models for textβ†’3D tasks need to re-train the NeRF model for each new text prompt (see Magic3D). It is always time consuming and computationally costly.
  • Solution: Amortized text-to-3D (ATT3D) - unified model is trained on many prompts simultaneously. It takes around 1 sec on a single GPU to produce an accurate object. The main idea is to use a modulation technique for encoding in Intasnt NGP structure.

As the results the model:

  • Generalized to new prompts
  • Can use interpoation between prompts
  • Amortized over settings other than text prompts

Pipeline

Fig 2. Pipeline

The main idea of this work is to use amortized optimization. This technique use learning to predict solutions when we repeatedly solve similar instances of the same problem.

TT3D process is split into two stages:

  • training: optimize one model offline to generate 3D objects for many different text prompts simultaneously. This amortizes optimization over the prompts, by sharing work between similar instances
  • cheap inference: user-facing stage uses our amortized model in a simple feed-forward pass to quickly generate an object given text, with no further optimization required.

Architecture.

The resulting model consists of several parts:

  • Mapping network
  • NeRF
  • Spatial grid features with parameters w

The major idea is to use modulation computed from text using a mapping network and then compute encoding in each point to use it in NeRF prediction and summarizing algorithm may be presented as following: in each optimization step, the authors sample several prompts and produce their potentially cached text embeddings, which are used to compute the modulations. Additionally, the camera poses and rendering conditions are sampled. These are combined with the NeRF module to render images and apply SDS loss.

Fig 3. Algorithm

In addition to amortizing optimization over many prompts, the authors presented a possibility to amortize over other variables like the choice of guidance weight, regularizers, data augmentation, and so on.

Fig 4. Interpolation over different parameters


Implementation details

Text Prompt datasets:

  • DreamFusion dataset (27 prompts from DeamFusion's main paper)
  • Compositional dataset: a compositional prompt set by composing fragments with the template β€œa{animal}{activity}{theme}”. Using this template the authors created a small pig-prompts and a larger animal-prompts datasets.
Fig 5. Different animal prompts
Fig 6. Pig prompts

Metrics:

  • Computational Cost: measure the number of rendered frames used for training (normalized by the number of prompts). Specifically, this is the number of optimization iterations times batch size divided by the total number of prompts in the dataset.
Fig 7. Computational budgers
  • Quality: CLIP R-precision, CLIP R-probability
Fig 8. Metrics

Models: Instant NGP, CLIP (embed text)

Mapping networks:

  • concatenation option
  • hypernetwork option
  • attention option

Pros and cons

  • Pros: around 1 sec for a new prompt for 1 GPU, generalization over different prompts, settings, and possibilities to interpolate.
  • Limitations: lack of diversity, problems with large text prompts, also as we do not optimize mesh-like in Magic3D pipeline - we lose in quality.

Results

Fig 9. Comparison to pre-prompt (for each prompt) training
Fig 10. Comparison to pre-prompt (for each prompt) training
Fig 11. Some other results


Report Page