One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

https://t.me/reading_ai, @AfeliaN

πŸ—‚οΈ Project Page

πŸ“„ Paper

πŸ“Ž GitHub (coming soon)

Main idea

Fig 1. Main idea

Motivation: existing methods suffers from several main problems:

  • time-consuming (need to train NeRF for each scene)
  • memory intensive (works mostly for low-res images)
  • 3D inconsistent
  • poor geometry

Solution: One-2-3-45 model that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. The authors utilize Zero123 model to generate multi-view predictions of the input single image so that can be leveraged multi-view 3D reconstruction techniques to obtain 3D mesh. To improve the geometry the authors use SDF-based NeRF models.

Pipeline

Fig 2. Pipeline

This work is basically based on 2 main techniques:

  • SparseNeuS - neural rendering based method for the task of surface reconstruction based on sparse images
  • Zero123 - given a single RGB image of an object and a relative camera transformation, Zero123 aims to control the diffusion model to synthesize a new image under this transformed camera view.

Zero123 - is a very promising method to create the dataset from one image to reconstruct 3D shapes. But the authors have shown, that the results are not satisfactory. This is primarily due to the inconsistency of Zero123’s predictions.

Fig 3. Zero123 prediction

To deal with the problem instead of using usual optimization-based approaches, the authors based the reconstruction module on a generalizable SDF reconstruction method SparseNeuS. The reconstruction module takes m posed source images as input. Then it builds const volume and learns the SDF and color.

In more detail:

  1. Render n ground-truth RGB images from Zero123
  2. For each of n views create 4 nearby views
  3. During training, the authors feed all 4Γ—n predictions with ground-truth poses into the reconstruction module and randomly choose one of the n ground-truth RGB images views as the target view.
  4. Training is supervised both with depth and RGB
  5. Additionally for the training process, the camera poses are estimated.

Implementation detail

Dataset: Objaverse-LVIS

Models:

  • Zero123 β†’ create images
  • SparseNeus β†’ sdf prediction

Compared with:

Metrics:

  • 3D reconstruction: CD, IoU

Pros and cons

Results

Images Comparison

Fig 4. Comparison with different models
Fig 5. More results
Fig 6. Comparison with Shape-E
Fig 7. Text comparison

Metrics

Fig 8. Metrics and time cost

Some results

Fig 9. Some results






Report Page