NeRF-Supervised Deep Stereo

NeRF-Supervised Deep Stereo

https://t.me/reading_ai, @AfeliaN

πŸ—‚οΈ Project Page

πŸ“„ Paper

πŸ“Ž GitHub

πŸ—“ Date: 9 Jun 2023

Main idea

  • Motivation: training deep stereo networks requires a lot of data. To solve this problem self-supervised methods are presented, but they are usually not so effective at dealing with ill-posed stereo settings (e.g. occlusions, non-Lambertian surfaces, etc.).
  • Solution: Using NeRF it is possible to render stereopairs to generate the new dataset and provide rendered depth to supervise training process.

Pipeline

Fig 1. Pipeline
  1. Step one: image collection

Firstly the authors collected multi-view images from multiple static scenes and using COLMAP the camera poses and intrinsic were calculated.

2. Step two: learn NeRF for each scene

3. Step three: stereo pairs rendering

  • generate a set of stereo extrinsic parameters (rotation is the identity matrix and translation b - vector representing translation along an x-axis)
  • render two novel views (for +- b translations). This creates a stereo triplet in which the three images are perfectly rectified
Fig 2. Stereo triplet
  • the disparity is extracted from rendered depth
Fig 3. Rendered depth

4. Step four: NeRF-Supervised Training Regime

To train the model the authors use several losses

  • Triplet Photometric loss
Fig 4. Triplet photometric loss

where \hat{I^r_c} - is the reconstructed image from the predicted disparity.

  • To deal with occlusions the authors compute the same loss but for the third image. So final results can be estimated as following:
Fig 5. Triplet photometric loss
  • Loss between predicted and rendered disparities
Fig 6. Disparity loss

But as depth maps rendered by NeRF usually contain artifacts the authors additionally use the Ambient Occlusion to measure the confidence of rendered depth and use a filtering mechanism to preserve only the most reliable pixels

The final loss can be written as following:

Fig 7. Final loss

Implementation details

Models:

Evaluation datasets: KITTI, Midd-A, Midd-21

Metrics: PSNR, SSIM, LPIPS, KID scores

Compared with: MfS with 3 different Stereo networks

Pros and cons

  • Pros: the idea of using NeRF to supervise training or generate new data is very promising for depth estimation problem
  • Limitations: Samples collected so far are limited to small-scale, static scenes.

Results

Fig 8. MfS comparison
Fig 9. Zero-Shot generalization Benchmark
Fig 10. Qualitative Comparison




Report Page