Diffusion Models

By Ajay Bhargava

Background

The idea behind this project was to use premade diffusion models for image generation, as well as training my own diffusion model. Some examples of results from this project are shown below.

Warped Image 1
A Campfire Right Side Up, Man Upside Down
Campanile Image 1
AI Predicts My Face
Campanile Image 1
Campanile Inpainted
Campanile Image 1
Number Generated By My Model

Implementing the Forward Process

The first step of this project was creating a way to add noise to any image. Taking in an image and a timestep on the range [0, 1000], where 0 represents the clean image and 1000 is pure noise, I used the following equation to output the image with a certain amount of noise.

\[ {x}_t = \sqrt{\bar{\alpha}_t} {x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, 1) \] \[ \bar{\alpha_t} \text{ represents the noise coefficient, as chosen by DeepFloyd's creators.} \]


Campanile Image 1
Campanile, t=0
Campanile Image 1
Noisy Campanile, t=250
Campanile Image 1
Noisy Campanile, t=500
Campanile Image 1
Noisy Campanile, t=750

Classical Denoising

The next step is to take in the noisy images as generated above, and attempt to 'denoise' them. The next couple of sections will investigate different ways to do so.

One method in which these images can be noised is simply Gaussian blur filtering. By applying a Gaussian filter, it blurs the image, but it gives poor results, as demonstrated below.

Campanile Image 1
Campanile, t=0
Campanile Image 1
Noisy Campanile, t=250
Campanile Image 1
Noisy Campanile, t=500
Campanile Image 1
Noisy Campanile, t=750
Campanile Image 1
Campanile, t=0
Campanile Image 1
Classically Denoised Campanile, t=250
Campanile Image 1
Classically Denoised Campanile, t=500
Campanile Image 1
Classically Denoised Campanile, t=750

One Step Denoising

The results of using classical denoising are subpar. We can use the convolutional neural net UNet in order to achieve better results. This model is trained on a vast number of image pairs, between their noisy and denoised forms. This can predict Gaussian noise added to an image, and using this prediction, we can recover an estimate of the original image.

This section demonstrates one step denoising, where we pass in the noisy image as well as its timestep, and the model makes a single prediction about the noise. The examples of the noisy and the one step denoised images are below.

Campanile Image 1
Campanile, t=0
Campanile Image 1
Noisy Campanile, t=250
Campanile Image 1
Noisy Campanile, t=500
Campanile Image 1
Noisy Campanile, t=750
Campanile Image 1
Campanile, t=0
Campanile Image 1
One Step Denoised Campanile, t=250
Campanile Image 1
One Step Denoised Campanile, t=500
Campanile Image 1
One Step Denoised Campanile, t=750

Iterative Denoising

The above one step denoising does provide better results. We can build on this by having the model predict the nois at each timestep, instead of entirely at once. The denoising is iteratively applied at each timestep, from our start to t = 0, which is the denoised image. The output of this is shown below for each of the timesteps, as well as the results of the other methods for comparison.

Campanile Image 1
Iterative Denoising at t=90
Campanile Image 1
Iterative Denoising at t=240
Campanile Image 1
Iterative Denoising at t=390
Campanile Image 1
Iterative Denoising at t=540
Campanile Image 1
Iterative Denoising at t=690
Campanile Image 1
Original Campanile
Campanile Image 1
Iterative Denoising Output
Campanile Image 1
One Step Denoising Output
Campanile Image 1
Gaussian Blur Output

Sampling from the Diffusion Model

We used the model above to predict noise based off an image that we already had, from a set timestep. However, we can generalize this to simply set out timestep to to the maximum, and pass in an image of complete noise. This essentially samples random images from the UNet model. 5 examples of this are shown below.

Campanile Image 1
Sample Generated Image
Campanile Image 1
Sample Generated Image
Campanile Image 1
Sample Generated Image
Campanile Image 1
Sample Generated Image
Campanile Image 1
Sample Generated Image

Classifier-Free Guidance

The images generated above are unique, but not very realistic. One way we can generate better images is through classifier-free guidance. We can update our noise prediction to use a combination of unconditional and conditional noise, as demonstrated in the equation below. This yields improved results than before.

\[ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u) \quad \] \[ \text{where} \quad \gamma \quad \text{is the strength of the CFG} \]

Campanile Image 1
CFG Sample Generated Image
Campanile Image 1
CFG Sample Generated Image
Campanile Image 1
CFG Sample Generated Image
Campanile Image 1
CFG Sample Generated Image
Campanile Image 1
CFG Sample Generated Image

Image to Image Translation

A similar process can be followed for producing image translations. By blurring the image at different levels, and then trying to predict the original image, we can create a range of pictures, with each one getting more and more similar to the original image. The i_start variable represents where we start our image from, with lower values meaning higher noise is added at the beginning, and higher values meaning less noise is added.

Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Campanile
Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Oski
Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Fish in Coral

Editing Hand-Drawn Images

This method is particularily interesting with hand drawn images, it can turn anything you draw into something more realistic. I chose one cartoon, as well as two random doodles I made and ran them through this process, and the results are shown below.

Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Spongebob
Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
My Drawing
Campanile Image 1
i_start = 1
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
My Drawing

Inpainting

We can follow a similar method in order to accomplish inpainting. This essentially takes in an image and a mask, and predicts only the part inside the mask, while leaving the rest unchanged. At each step, we force the black part of the mask to have the pixels from the original image (with the correct noise added), and the model's output for the mask area is kept. The results of images and their corresponding masks are shown below.

Campanile Image 1
Campanile
Campanile Image 1
Mask
Campanile Image 1
Portion Replaced
Campanile Image 1
Campanile Inpainted
Campanile Image 1
Campanile
Campanile Image 1
Mask
Campanile Image 1
Portion Replaced
Campanile Image 1
Campanile Inpainted
Campanile Image 1
Me
Campanile Image 1
Mask
Campanile Image 1
Portion Replaced
Campanile Image 1
Not Me

Text Conditional Image to Image Translation

Currently, the model is predicting noise based off of the prompt "a high quality image". If we change the prompt to be something more specific, we can generate images that are closer to that prompt as the noise gets higher. Some examples of this are shown below, with the original image on the right, and the prompt that was given is on the left (where i_start = 1).

Campanile Image 1
Rocket Ship
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Campanile
Campanile Image 1
Cartoon Flying Car
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Flying Car
Campanile Image 1
Guy wearing glasses
Campanile Image 1
i_start = 3
Campanile Image 1
i_start = 5
Campanile Image 1
i_start = 7
Campanile Image 1
i_start = 10
Campanile Image 1
i_start = 20
Campanile Image 1
Sona

Visual Anagrams

Another unique application of this process is creating a visual anagram. This is where the image appears to be two different things depending on which way it is flipped. To create this, we average the predicted noise of the first prompt, along with the second prompt's predicted noise flipped vertically. The results of this are shown below, you can hover over the image to flip them as well.

Warped Image 1
Oil Painting of People Around a Campfire
Warped Image 1
Oil Painting of a Man
Warped Image 1
Oil Painting of a Tiger
Warped Image 1
Oil Painting of Sumo Wrestler

Warped Image 1
Oil Painting of Obama
Warped Image 1
Oil Painting of a Walrus
Warped Image 1
Oil Painting of a Surfer
Warped Image 1
Oil Painting of Fruits

Hybrid Images

Hybrid images are where the image appears to be one thing when up close, but another from afar. In this case, we can predict the noise for two different prompts, and at each step, run one through a high-pass filter, and one through a low-pass filter. This enables us to create the illusions as seen below. Hover over the images to view them zoomed out.

Warped Image 1
Waterfall up close, Skull from afar
Warped Image 1
Puppies up close, Chicken from afar
Warped Image 1
Winter Day up close, Panda from afar

Training a Single-Step UNet

The above sections all rely on a pretrained UNet in order to perform the tasks. The following sections will now investigate how I can train my own model to achieve similar results.

I began by creating a single-step denoising UNet. It takes in a noisy image, and predicts the original image by optimizing over an L2 loss between the original and outputted images. I followed the architecture shown below:

Warped Image 1
UNet Architecture

I trained this model using L2 loss function and Adam optimizer, the loss across iterations is shown below.

Warped Image 1
Loss Graph

Here are the results shown after 1 epoch.

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 1 epoch)

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 1 epoch)

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 1 epoch)

Here are the results shown after all 5 epochs.

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 5 epochs)

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 5 epochs)

Warped Image 1
Original Image
Warped Image 1
Noisy Image
Warped Image 1
Denoised Image (After 5 epochs)

This model was trained with a σ value of 0.5, meaning the random noise generated is multipled by 0.5 before being added to the original image. We can vary this value, and test how well our model trained on 0.5 will work for other values.

Warped Image 1
Noisy, σ = 0
Warped Image 1
Noisy, σ = 0.2
Warped Image 1
Noisy, σ = 0.4
Warped Image 1
Noisy, σ = 0.5
Warped Image 1
Noisy, σ = 0.6
Warped Image 1
Noisy, σ = 0.8
Warped Image 1
Noisy, σ = 1
Warped Image 1
Denoised, σ = 0
Warped Image 1
Denoised, σ = 0.2
Warped Image 1
Denoised, σ = 0.4
Warped Image 1
Denoised, σ = 0.5
Warped Image 1
Denoised, σ = 0.6
Warped Image 1
Denoised, σ = 0.8
Warped Image 1
Denoised, σ = 1

Time Conditioned UNet

As we saw from the earlier results of image denoising, better results can be found by iteratively denoising an image. The following part is based off of the paper Denoising Diffusion Probabilistic Models (2020).

By adding a scalar t as a parameter to our model, we can condition it based off of the time, the method in which t is used is demonstrated in the image below.

Warped Image 1
UNet Architecture with t

The training algorithm for this model is as follows:

Warped Image 1
Training Algorithm

The loss curve for the training of this model is shown below:

Warped Image 1
Training Algorithm

Results for this model shown at 5 epochs and 20 epochs are shown below:

Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Generated Image at 5 epochs
Warped Image 1
Generated Image at 5 epochs
Warped Image 1
Generated Image at 5 epochs
Warped Image 1
Generated Image at 5 epochs
Warped Image 1
Generated Image at 5 epochs




Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Pure Noise
Warped Image 1
Generated Image at 20 epochs
Warped Image 1
Generated Image at 20 epochs
Warped Image 1
Generated Image at 20 epochs
Warped Image 1
Generated Image at 20 epochs
Warped Image 1
Generated Image at 20 epochs

Class-Conditioned UNet

We can improve on these results by conditioning the UNet on the digits 0-9. This involves training on the digit numbers itself with each of the images, and passing this as a parameter through the model as well.

Warped Image 1
Training Algorithm

The loss curve for the training of this model is shown below:

Warped Image 1
Loss Curve

Results of the model shown at 5 epochs is shown below.

Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9


Results of the model shown at 20 epochs is shown below.

Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9
Warped Image 1
Class 0
Warped Image 1
Class 1
Warped Image 1
Class 2
Warped Image 1
Class 3
Warped Image 1
Class 4
Warped Image 1
Class 5
Warped Image 1
Class 6
Warped Image 1
Class 7
Warped Image 1
Class 8
Warped Image 1
Class 9