One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization
https://t.me/reading_ai, @AfeliaNποΈ Project Page
π Paper
π GitHub (coming soon)
Main idea

Motivation: existing methods suffers from several main problems:
- time-consuming (need to train NeRF for each scene)
- memory intensive (works mostly for low-res images)
- 3D inconsistent
- poor geometry
Solution: One-2-3-45 model that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. The authors utilize Zero123 model to generate multi-view predictions of the input single image so that can be leveraged multi-view 3D reconstruction techniques to obtain 3D mesh. To improve the geometry the authors use SDF-based NeRF models.
Pipeline

This work is basically based on 2 main techniques:
- SparseNeuS - neural rendering based method for the task of surface reconstruction based on sparse images
- Zero123 - given a single RGB image of an object and a relative camera transformation, Zero123 aims to control the diffusion model to synthesize a new image under this transformed camera view.
Zero123 - is a very promising method to create the dataset from one image to reconstruct 3D shapes. But the authors have shown, that the results are not satisfactory. This is primarily due to the inconsistency of Zero123βs predictions.

To deal with the problem instead of using usual optimization-based approaches, the authors based the reconstruction module on a generalizable SDF reconstruction method SparseNeuS. The reconstruction module takes m posed source images as input. Then it builds const volume and learns the SDF and color.
In more detail:
- Render n ground-truth RGB images from Zero123
- For each of n views create 4 nearby views
- During training, the authors feed all 4Γn predictions with ground-truth poses into the reconstruction module and randomly choose one of the n ground-truth RGB images views as the target view.
- Training is supervised both with depth and RGB
- Additionally for the training process, the camera poses are estimated.
Implementation detail
Dataset: Objaverse-LVIS
Models:
- Zero123 β create images
- SparseNeus β sdf prediction
Compared with:
- 3D reconstruction: Point-E, Shape-E, Zero123, 3DFuse, RealFusion
Metrics:
- 3D reconstruction: CD, IoU
Pros and cons
- Pros: using SDF-based models leads to better quality got geometry representation, rather fast.
- Limitations: needs to be compared with ATT3D.
Results
Images Comparison




Metrics

Some results
