Diffusion with Forward Models:
Solving Stochastic Inverse Problems Without Direct Supervision

1Massachusetts Institute of Technology, 2Princeton IAS, *Equal Contributions
This diffusion model directly samples from the distribution of 3D scenes conditioned on a single image, trained only on 2D images.

Denoising diffusion models have emerged as a powerful class of generative models capable of capturing the distributions of complex, real-world signals. However, current approaches can only model distributions for which training samples are directly accessible, which is not the case in many real-world tasks. In inverse graphics, for instance, we seek to sample from a distribution over 3D scenes consistent with an image but do not have access to ground-truth 3D scenes, only 2D images.

We present a new class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never observed directly, but instead are only measured through a known differentiable forward model that generates partial observations of the unknown signal. To accomplish this, we directly integrate the forward model into the denoising process. At test time, our approach enables us to sample from the distribution over underlying signals consistent with some partial observation.

We demonstrate the efficacy of our approach on three challenging computer vision tasks. For instance, in inverse graphics, we demonstrate that our model enables us to directly sample from the distribution 3D scenes consistent with a single 2D input image.

Inverse Graphics

We first show results on reconstructing 3D scenes from a single input image. Each result shows the input image (left), a smooth camera rendering from the reconstructed 3D scene (middle), and the corresponding depth map rendered from the recosntructed 3D scene (right). Note that our method can reconstruct high-quality geometry for the visible part, and can reconstruct plausible geometry and appearance even outside the visible 3D region, thanks to our generative model. Our model can directly reason about the uncertainties in the 3D space, conditioned on a single image.

We show results on two challenging datasets, RealEstate10k that includes complex unbounded indoor and outdoor scenes, and Co3D that includes 360 degree object-centric scenes.


Co3D Hydrants

Sample Diversity

Our model learns to reason about the uncertainty in the 3D reconstruction. We can sample from this space of uncertainty, i.e., we can sample multiple plausble scenes conditioned on the same context image.

Here, the input image is shown on left, the middle cell shows results of our method, where in the middle of the video, we stop and explore multiple plausible scenes from that viewpoint.

The cell on the right visualizes the pixel-wise variance across 10 sampled scenes. Note that close to the input view, there is very little uncertainty, and that the uncertainty is largest in the invisble parts of the scene. Our model even captures the monocular geometric uncertainties in the visible regions of the scene, for example, notice that away from the input camera view, the table and couch at different distances to the camera in the first example.

You can find complete videos of many different samples at the bottom of the page.


Sampling Motion Fields

We further demonstrate two more applications of our model. Here, we show results for sampling multiple plausible motion fields consistent with a single static image. Our model can reason about potential motion that is consitent with the static input image.

In this visualization, we apply the motion field continuously to the input image, in order to create a very short video clip. Note that the motion field only describes local, small, motion. Please look closer and zoom in to observe the deformations (for example, in the left-most example, notice the motion on the face in the first sample, and on the hands in the second sample).

More Variations of 3D Reconstructions

Each grid shows a distribution of scenes generated from the same context image. The right-most cell represents the per-pixel variance across samples. Notice how the variance increases as we move farther from the context image.


      title = { Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision },
      author = { Tewari, Ayush and 
                 Yin, Tianwei and 
                 Cazenavette, George and 
                 Rezchikov, Semon and 
                 Tenenbaum, Joshua B. and 
                 Durand, Frédo and 
                 Freeman, William T. and 
                 Sitzmann, Vincent },
      year = { 2023 },
      booktitle = { arXiv },