AI Made Friendly HERE

Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand ✍︎ | by Srijanie Dey, PhD | Apr, 2024

This is a story I told my son about a mythical dragon that lived in a far away land. We called it ‘The Legend of Sora’. He really enjoyed it because Sora is big and strong, and illuminated the sky. Now of course, he doesn’t understand the idea of transformers and diffusion yet, he’s only four, but he does understand the idea of a magnanimous dragon that uses the power of light and rules over DiTharos.

Image by author (The powerful Sora by my son — the color choices and the bold strokes are all his work.)

And that story very closely resembles how our world’s Sora, Open AI’s text-to-video model emerged in the realm of AI and has taken the world by storm. In principle, Sora is a diffusion transformer (DiT) developed by William Peebles and Saining Xie in 2023.

In other words, it uses the idea of diffusion for predicting the videos and the strength of transformers for next-level scaling. To understand this further, let’s try to find the answer to these two questions:

  • What does Sora do when given a prompt to work on?
  • How does it combine the diffusion-transformer ideas?

Talking about the videos made by Sora, here is my favorite one of an adorable Dalmatian in the streets of Italy. How natural is its movement!

The prompt used for the video : “The camera directly faces colorful buildings in Burano Italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.”

How did Sora do this?

Without any further ado, let’s dive into the details and look at how Sora creates these super-realistic videos based on text-prompts.

Thanks once again to Prof. Tom Yeh’s wonderful AI by Hand Series, we have this great piece on Sora for our discussion. (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn post, which I have edited with his permission.)

So, here we go:

Our goal — Generate a video based on a text-prompt.

We are given:

  • Training video
  • Text-prompt
  • Diffusion step t = 3

For our example, can you guess what our text-prompt is going to be? You are right. It is “Sora is sky”. A diffusion step of t = 3 means we are adding noise or diffusing the model in three steps but for illustration we will stick to one in this example.

What is diffusion?

Diffusion mainly refers to the phenomenon of scattering of particles — think how we enjoy the soft sun rays making a peak from behind the clouds. This soft glow can be attributed to the scattering of sunlight as it passes through the cloud layer causing the rays to spread out in different directions.

The random motion of the particles drives this diffusion. And that is exactly what happens for diffusion models used in image generation. Random noise is added to the image causing the elements in the image to deviate from the original and thus making way for creating more refined images.

As we talk about diffusion in regards to image-models, the key idea to remember is ‘noise’.

The process begins here:

[1] Convert video into patches

When working with text-generation, the models break down the large corpus into small pieces called tokens and use these tokens for all the calculations. Similarly, Sora breaks down the video into smaller elements called visual patches to make the work simpler.

Since we are talking about a video, we are talking about images in multiple frames. In our example, we have four frames. Each of the four frames or matrices contain the pixels that create the image.

The first step here is to convert this training video into 4 spacetime patches as below:

[2] Reduce the dimension of these visual patches : Encoder

Next, dimension reduction. The idea of dimension reduction has existed for over a century now (Trivia : Principal Component Analysis, also known as PCA was introduced by Karl Pearson in 1901), but its significance hasn’t faded over time.

And Sora uses it too!

When we talk about Neural Networks, one of the fundamental ideas for dimension reduction is the encoder. Encoder, by its design, transforms high-dimensional data into lower-dimension by focusing on capturing the most relevant features of the data. Win-win on both sides: it increases the efficiency and speed of the computations while the algorithm gets useful data to work on.

Sora uses the same idea for converting the high-dimensional pixels into a lower-dimensional latent space. To do so, we multiply the patches with weights and biases, followed by ReLU.


Linear transformation : The input embedding vector is multiplied by the weight matrix W and

then added with the bias vector b,

z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.

ReLU activation function : Next, we apply the ReLU to this intermediate z.

ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.

  • The weight matrix here is a 2×4 matrix [ [1, 0, -1, 0], [0, 1, 0, 1] ] with the bias being [0,1].
  • The patches matrix here is 4×4.

The product of the transpose of the weight matrix W and bias b with the patches followed by ReLU gives us a latent space which is only a 2×4 matrix. Thus, by using the visual encoder the dimension of the ‘model’ is reduced from 4 (2x2x1) to 2 (2×1).

In the original DiT paper, this reduction is from 196,608 (256x256x3) to 4096 (32x32x4), which is huge. Imagine working with 196,608 pixels against working with 4096 — a 48 times reduction!

Right after this dimension reduction, we have one of the most significant steps in the entire process — diffusion.

[3] Diffuse the model with noise

To introduce diffusion, we add sampled noise to the obtained latent features in the previous step to find the Noised Latent. The goal here is to ask the model to detect what the noise is.

This is in essence the idea of diffusion for image generation.

By adding noise to the image, the model is asked to guess what the noise is and what it looks like. In return, the model can generate a completely new image based on what it guessed and learnt from the noisy image.

It can also be seen relative to deleting a word from the language model and asking it to guess what the deleted word was.

Now that the training video has been reduced and diffused with noise, the next steps are to make use of the text-prompt to get a video as advocated by the prompt. We do this by conditioning with the adaptive norm layer.

[4]-[6] Conditioning by Adaptive Norm Layer

What ‘conditioning’ essentially means is we try to influence the behavior of the model using the additional information we have available. For eg: since our prompt is ‘Sora is sky’, we would like for the model to focus on elements such as sky or clouds rather attaching importance on other concepts like a hat or a plant. Thus, an adaptive norm layer massages, in better terms — dynamically scales and shifts the data in the network based on the input it receives.

What is scale and shift?

Scale occurs when we multiply, for e.g. we may start with a variable A. When we multiply it with 2 suppose, we get 2*A which amplifies or scales the value of A up by 2. If we multiply it by ½, the value is scaled down by 0.5.

Shift is denoted by addition, for e.g. we may be walking on the number line. We start with 1 and we are asked to shift to 5. What do we do? We can either add 4 and get 1+4=5 or we could add a hundred 0.4s to get to 5, 1+(100*0.04 )= 5. It all depends on if we want to take bigger steps (4) or smaller steps (0.04) to reach our goal.

[4] Encode Conditions

To make use of the conditions, in our case the information we have for building the model, first we translate it into a form the model understands, i.e., vectors.

  • The first step in the process is to translate the prompt into a text embedding vector.
  • The next step is to translate step t = 3 into a binary vector.
  • The third step is to concatenate these vectors together.

[5] Estimate Scale/Shift

Remember that here we use an ‘adaptive’ layer norm which implies that it adapts its values based on what the current conditions of the model are. Thus, to capture the correct essence of the data, we need to include the importance of each element in the data. And it is done by estimating the scale and shift.

For estimating these values for our model, we multiply the concatenated vector of prompt and diffusion step with the weight and add the bias to it. These weights and biases are learnable parameters which the model learns and updates.

(Remark: The third element in the resultant vector, according to me, should be 1. It could be a small error in the original post but as humans we are allowed a bit of it, aren’t we? To maintain uniformity, I continue here with the values from the original post.)

The goal here is to estimate the scale [2,-1] and the shift [-1,5] (since our model size is 2, we have two scale and two shift parameters). We keep them under ‘X’ and ‘+’ respectively.

[6] Apply Scale/Shift

To apply the scale and shift obtained in the previous step, we multiply the noised latent in Step 3 by [2, -1] and shift it by adding [-1,5].

The result is the ‘conditioned’ noise latent.

[7]-[9] Transformer

The last three steps consist of adding the transformer element to the above diffusion and conditioning steps. This step help us find the noise as predicted by the model.

[7] Self-Attention

This is the critical idea behind transformers that make them so phenomenal!

What is self-attention?

It is a mechanism by which each word in a sentence analyzes every other word and measures how important they are to each other, making sense of the context and relationships in the text.

To enable self-attention, the conditioned noise latent is fed into the Query-Key function to obtain a self-attention matrix. The QK-values are omitted here for simplicity.

[8] Attention Pooling

Next, we multiply the conditioned noised latent with the self-attention matrix to obtain the attention weighted features.

[9] Point-wise Feed Forward Network

Once again returning back to the basics, we multiply the attention-weighted features with weights and biases to obtain the predicted noise.


The last bit now is to train the model using Mean Square Error between the predicted noise and the sampled noise (ground truth).

[10] Calculate the MSE loss gradients and update learnable parameters

Using the MSE loss gradients, we use backpropagation to update all the parameters that are learnable (for e.g. the weights and biases in the adaptive norm layer).

The encoder and decoder parameters are frozen and not learnable.

(Remark: The second element in the second row should be -1, a tiny error which makes things better).

[11]-[13] Generate New Samples

[11] Denoise

Now that we are ready to generate new videos (yay!), we first need to remove the noise we had introduced. To do so, we subtract the predicted noise from the noise-latent to obtain noise-free latent.

Mind you, this is not the same as our original latent. Reason being we went through multiple conditioning and attention steps in between that included the context of our problem into the model. Thus, allowing the model a better feel for what its target should be while generating the video.

[12] Convert the latent space back to the pixels : Decoder

Just like we did for encoders, we multiply the latent space patches with weight and biases while followed by ReLU. We can observe here that after the work of the decoder, the model is back to the original dimension of 4 which was lowered to 2 when we had used the encoder.

[13] Time for the video!

The last step is to arrange the result from the above matrix into a sequence of frames which finally gives us our new video. Hooray!

And with that we come to the end of this supremely powerful technique. Congratulations, you have created a Sora video!

To summarize all that was said and done above, here are the 5 key points:

  1. Converting the videos into visual patches and then reducing their dimension is essential. A visual encoder is our friend here.
  2. As the name suggests, diffusion is the name of the game in this method. Adding noise to the video and then working with it at each of the subsequent steps (in different ways) is what this technique relies on.
  3. Next up is the transformer architecture that enhances the abilities of the diffusion process along with amplifying the scale of the model.
  4. Once the model is trained and ready to converge to a solution, the two D’s — denoiser and decoder come in handy. One by removing the noise and the other by projecting the low-dimensional space to its original dimension.
  5. Finally, the resultant pixels from the decoder are rearranged to generate the desired video.

(Once you are done with the article, I suggest you to read the story at the beginning once more. Can you spot the similarities between Sora of DiTharos and Sora of our world?)

The kind of videos Sora has been able to produce, it is worth saying that the Diffusion-Transformer duo is lethal. Along with it, the idea of visual patches opens up an avenue for tinkering with a range of image resolutions, aspect ratios and durations, which allows for utmost experimentation.

Overall, it would not be wrong to say that this idea is seminal and without a doubt is here to stay. According to this New York Times article , Sora was named after the Japanese word for sky and to evoke the idea of limitless potential. And having witnessed its initial promise, it is true that Sora has definitely set a new frontier in AI. Now it remains to see how well it stands the test of safety and time.

As the legend of DiTharos goes — “Sora lives on, honing its skills and getting stronger with each passing day, ready to fly when the hour is golden!”

P.S. If you would like to work through this exercise on your own, here is a blank template for you to use.

Blank Template for hand-exercise

Now go have some fun with Sora in the land of ‘DiTharos’!

Originally Appeared Here

You May Also Like

About the Author:

Early Bird