CS280A Project 5: Fun with Diffusion Models

by Junhua (Michael) Ma

Part A: The Power of Diffusion Models

Diffusion Model Sampling

For part A of the project, the DeepFloyd IF diffusion model is used. In this section, the diffusion model is used to generate images for selected prompts with different settings of num_inference_steps or nis for short.

Prompt: "an oil painting of a snowy mountain village"

nis=3

nis=6

nis=10

nis=20

nis=40

Prompt: "a man wearing a hat"

nis=3

nis=6

nis=10

nis=20

nis=40

Prompt: "a rocket ship"

nis=3

nis=6

nis=10

nis=20

nis=40

From the results, the num_inference_steps value affects the quality of the generated image, with higher value resulting in higher quality image and takes longer time to run, and lower value resulting in poorer quality image that are more noisy and blurry.

Forward & Denoising

Forward Process

The forward process is a key part of diffusion which adds random gaussian noise to a clean image x_0 to generate noisy image x_t for t = [0, T] using interpolation concepts to allow control the amount of noise, with x_T being pure noise image.

Campanile, forwarding with T = 999

t = 0 (original)

t = 250

t = 500

t = 750

Classical Denoising

Simple image denoising using gaussian blur filtering to remove noise.

Campanile, gaussian blur denoising with T = 999

Noisy

t = 250

t = 500

t = 750

Denoised

One-Step Denoising

Image denoising using stage 1 unet of DeepFloyd diffusion model as denoiser with the prompt "a high quality photo". The unet outputs the estimated noise, and the denoised image is obtained by subtracting the estimated noise from the noisy image.

Campanile, one-step denoising with T = 999

Noisy

t = 250

t = 500

t = 750

Denoised

Iterative Denoising

Using the same one-step denoising setup with DeepFloyd diffusion model, we can denoise iteratively over multiple steps instead to denoise the image. This will improve the denoising outcome when there's more noise at larger timesteps (as t gets closer to T).

The overall idea of iterative denoising is that given noisy image x_t for a certain timestep t_start in range [0, T], we can break up the range of timesteps [0, t_start] into equivalent strides and perform one-step denoising over each stride. With this setup, the noisy image x_t becomes a bit less noisy after each iteration until it becomes clean at x_0.

Campanile, iterative denoising with T = 999, stride = 30, t_start = 690

t = 690

t = 540

t = 390

t = 240

t = 90

Original

Iteratively Denoised

One-Step Denoised

Gaussian Blur Denoised

Note: Unless otherwise specified, all results in later sections uses T = 999 and stride = 30.

Iterative Denoising Sampling & CFG

Sampling with Iterative Denoising

Using iterative denoising setup but starting at a larger timestep close to T or starting with noisy image close to pure noise, we can generate random images from the sampling process. The initial image at t_start in this case is a pure noise image generated from normal distribution instead of adding noise to a clean image.

Samples from iterative denoising with t_start = 990

Classifier-Free Guidance (CFG)

To improve the quality of generated images at the cost of decreased diversity, the technique CFG can be added to the iterative denoising setup. Instead of directly obtaining noise estimate from the unet, we now first obtain both a conditional and an unconditional noise estimate, and the final noise estimate is derived using both noise estimates. The CFG scaling factor gamma controls the strength of CFG, which results in higher quality images when gamma > 1. The conditional noise estimate is the output of unet with prompt "a high quality photo" and the unconditional noise estimate is the output of unet with empty null prompt "".

Samples from iterative denoising + CFG with gamma = 7.0, t_start = 990 (with upsampling)

Note: Unless otherwise specified, all results in the following sections refers to iterative denoising with CFG with gamma = 7.0 simply as the "iterative denoising setup", where CFG is always applied.

Image-to-Image Translation

SDEdit

We can follow the SDEdit algorithm to make edits to a given image by adding a certain amount of noise to it and then denoise it using mostly same iterative denoising with CFG setup as previous section. The more noise added, the larger the edit as iterative denoising happens over more iterations which leads to more room to deviate from the original clean image. With too much noise added such that the image is close to pure noise, then the editing process would generate more random images similar to the previous section.

Campanile, SDEdit

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Editing Web & Hand-Drawn Images

Using the SDEdit setup, we can edit images by adding certain amount of noises and then denoise them. When added noise is not too much, the denoised outcome can still retain some properties of the original image while also introduce new features.

Pikachu (Web), SDEdit (with upsampling)

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Colored Cat (Drawn), SDEdit

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Car (Drawn), SDEdit

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Inpainting

Using the SDEdit setup, we can apply a binary mask to every iteration such that noise and denoising only happens within selected region of the image, while the rest of the image remains unchanged.

Campanile, masked SDEdit with t_start = 390

original

Mask

To Replace

Inpainted

Palace (Object Removal), masked SDEdit with t_start = 690

original

Mask

To Replace

Inpainted

Cat (Background Swap), masked SDEdit

original

Mask

To Replace

t_start = 390

t_start = 540

t_start = 690

t_start = 840

t_start = 960

Text-Conditional Image-to-Image Translation

Using the SDEdit setup, instead of using the generic prompt "a high quality photo" as conditional prompt input to the unet such that random images would be sampled from pure noise and leads to more unpredictable edits, we can instead use a more specific conditional prompt so that the unet would denoise towards a more specific direction instead. This leads to edits that have certain focus.

Campanile, text-conditional SDEdit with prompt = "a rocket ship"

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Einstein, text-conditional SDEdit with prompt = "an oil painting of an old man"

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Cat, text-conditional SDEdit with prompt = "a photo of a dog"

original

t_start = 390

t_start = 690

t_start = 780

t_start = 840

t_start = 900

t_start = 960

Constrained Image Generation

Visual Anagrams

Using iterative denoising setup, but the final noise estimate is an average of noise estimate of image with promp1 computed as e1 = unet(x_t, t, prompt1) and noise estimate of flipped image with prompt2 computed as e2 = flip(unet(flip(x_t), t, prompt2)). This would results in generated images looking like prompt1 in the normal orientation and prompt2 when flipped upside down.

Anagrams (images on right side are upside-down of images on left side, and follows prompt2)

prompt1 = "an oil painting of people around a campfire"

prompt2 = "an oil painting of an old man"

prompt1 = "an oil painting of people around a campfire"

prompt2 = "an oil painting of a snowy mountain village"

prompt1 = "a lithograph of waterfalls"

prompt2 = "a lithograph of a skull"

Hybrid Images

Using iterative denoising setup, but the final noise estimate is the frequency blend two noise esimates e1 = unet(xt, t, prompt1) and e2 = unet(xt, t, prompt2), specifically e = lf(e1) + hf(e2) where hf is high pass filter and lf is simple gaussian blur low pass filter. This would generate images that look like prompt1 when viewed from further away and prompt2 when viewed closer to the image.

Hybrids with prompt1 = "a lithograph of a skull" (far), prompt2 = "a lithograph of waterfalls" (near) (with upsampling)

Hybrids with prompt1 = "an oil painting of people around a campfire" (far), prompt2 = "an oil painting of a snowy mountain village" (near) (with upsampling)

Hybrids with prompt1 = "a lithograph of waterfalls" (far), prompt2 = "an oil painting of a snowy mountain village" (near) (with upsampling)

Bells & Whistles

Image Blending

Using iterative denoising setup, but the final noise is a weighted sum of noise estimates e1 = unet(x_t, t, prompt1) and e2 = unet(x_t, t, prompt2), specifically e = e1 * R + e2 * (1 - R) with ratio R. This would generate images that are a blend of two prompts, the R is used to tune the process for improved results.

Blends with prompt1 = "a rocket ship", prompt2 = "a pencil", R = 0.35 (with upsampling)

Blends with prompt1 = "an oil painting of an old man", prompt2 = "a lithograph of a skull", R = 0.75 (with upsampling)

Symmetric Images

Using iterative denoising setup, but at each iteration a symmetric version of the image x_sym_t is obtained by setting right half to be the flipped left side of image x_t. The final image x_t is a weight sum of the original x_t and x_sym_t by the formula x_t = x_t * (1 - sym_R) + x_sym_t * sym_R for a certain ratio sym_R.

From my testing, the generated images would have strange and noisy background if bigger sym_R values are used (including sym_R = 1 which means the weighted sum process is not applied and x_t = x_sym_t directly), and the weighted sum especially with smaller sym_R values seems to improve the results.

Since the symmetry constraint is simply applied over x_t at the end of each iteration, it can be easily appended to many previous procedures, including the visual anagrams to generate symmetric anagram images.

Symmetric with prompt = "a photo of a man", sym_R = 0.25

Symmetric with prompt = "a lithograph of waterfalls", sym_R = 0.25

Symmetric Anagram with sym_R = 0.15

prompt1 = "an oil painting of people around a campfire"

prompt2 = "an oil painting of an old man"

Sequential Image Generation

Using masked iterative denoising setup, I attempted to generate images to the left and right of an given image by shifting the original image to the left or right, and filling the empty areas with random noise. For each iteration, the amount of shifting is based on the step_size. After the new leftmost or rightmost patch is replaced, the new image is shifted again, and the process repeats until the entire original image has been shifted to either left or right, and only generated image remains.

Overall, this simple method does seem to generate some cool results, but it takes a while to run and also the initially generated images are quite unclear and noisy, which I applied a quick one-step denoising to clean up the result.

Sequential Generation with step_size = 4

Generated Left

Original

Generated Right

Generating left image

Part B: Diffusion Models from Scratch

Training Denoising Unet

Training

Model: The unet model is implemented using PyTorch, which follows a simple encoder and decoder architecture with hidden dimension D = 128.

Dataset: The MNIST dataset is used to train the denoiser. For each image X, a certain amount of random noise will be added to the image to form noisy image X_n, and the pair (X_n, X) will be used for training. X_n is input to unet that outputs a clean image X_pred which will be compared with the original clean image X as label to compute the loss. The L2 loss function is used.

As shown in chart below, different amount of noise specified by sigma between 0 and 1 is added to the MNIST images, with sigma = 0.0 being the original clean MNIST images, and resulting images becomes incresingly more noisy as sigma goes from 0 to 1.

Training Results: For this section, we train a simple unet using only sigma = 0.5 for all MNIST images. This should result in unet model being good at denoising noisy images at a constant noise level. The training is done over 5 epochs, with batch size of 256 and Adam optimizer with learning rate of 1e-4.

The training loss plot is shown below.

Sampled denoising results after 1 epoch:

Sampled denoising results after 5 epochs:

Out-of-Distribution Testing

As the denoiser is only trained for sigma = 0.5, we can also test the performance of the unet model on other sigma values to see how well the model performs generally.

The result is shown below where for different sigma values, the noisy images (top row) and the corresponding denoised images (bottom row).

From the result we can see that the unet model performs well for all sigma values less than or equal to 0.5, which are images that are no more noisy than its training data. For more noisy images with sigma > 0.5, the output is less clear but still surprisingly good overall.

Training Diffusion Model

Adding Time-Conditioning to Unet

The overall setup aims to implement DDPM to create a diffusion model from the unet.

Model: To better generalized the unet to different noise levels and works as part of an actual diffusion model, the time scalar t is injected into the network to enable time-conditioning for the unet. In addition, the unet is modified to output predicted noise image instead of predicted clean image, which follows the DeepFloyd unet model setup. The modified unet would basically function as noise_est = unet(x_t, t) given noisy image x_t at noise level t. The number of hidden dimension of unet used is D = 64.

Dataset: Given MNIST image, a certain random noise e is added to the image to create noisy image x_t, with amount of noise based on t. The x_t and t will be inputed into the network to obtain the predicted noise e_pred, and the loss will be computed as the L2 loss between e and e_pred.

Training Results: The training is done over 20 epochs, with batch size of 128, Adam optimizer with initial learning rate of 1e-3, and exponential elarning rate decay schedular with gamma = 0.1^(1/num_epochs).

The training loss plot is shown below.

Sampled results after different number of epochs during training (with T = 300, stride = 1, no CFG):

epoch = 1

gif

epoch = 5

gif

epoch = 10

gif

epoch = 20

gif

Adding Class-Conditioning to Unet

Model: For better results and more control for image generation, we can add class-conditioning to the unet by injecting class c from 0 to 9 (with one-hot encoding) in a way similar to adding time-conditioning. The unet now functions as noise_est = unet(x_t, t, c) given noisy image x_t, noise level t, and class c.

Dataset: Given MNIST image and its label c, a certain random noise e is added to the image to create noisy image x_t, with amount of noise based on t. The x_t, t, and c will be inputed into the network to obtain the predicted noise e_pred, and the loss will be computed as the L2 loss between e and e_pred.

Training Results: The training is done over 20 epochs, with batch size of 128, Adam optimizer with initial learning rate of 1e-3, and exponential elarning rate decay schedular with gamma = 0.1^(1/num_epochs).

The training loss plot is shown below.

Sampled results after different number of epochs during training (with with T = 300, stride = 1, and CFG with gamma = 5.0):

epoch = 1

gif

epoch = 5

gif

epoch = 10

gif

epoch = 20

gif