In February, OpenAI stunned the world with its latest AI model, Sora. The latest offering from the Sam Altman-led company could use prompts in natural language and generate minute-long videos in high definition. This model, which came after Runway’s Gen-2 and Google’s Lumiere, showcased some breathtaking capabilities of video generation that could potentially replace filmmaking in the future.
At present, there are two kinds of models that are fuelling the AI innovation – transformers and diffusion models. These are essentially architecture that have redefined the landscape of machine learning, a subset of AI, applications. Transformer-based models have radically changed how machine learning models engage with text data, in terms of classifying it and generating it. On the other hand, diffusion models have become the most preferred for AI that generates images.
It needs to be noted that Diffusion models are the process of diffusion which is essentially the spreading of particles from a dense space to a lesser dense area. Sora is not a large language model (LLM), but is a diffusion transformer model. In this article, we will understand what a diffusion transformer model is, and how it is different from other AI models.
What is a Diffusion transformer?
Diffusion transformer also written as DiT is essentially a class of diffusion models that are based on the transformer architecture. DiT has been developed by William Peebles at UC Berkeley, who is currently the research scientist at OpenAI, and Saining XE at New York University in 2023. DiT is aimed at improving the performance of diffusion models by switching the commonly used U-Net (an architecture employed in diffusion models for iterative image denoising) backbone with a transformer.
Let’s simplify this – imagine you have a big jigsaw puzzle to solve. but you don’t know how the whole picture looks. So, you try to figure out one piece at a time. The DiT is like a special way to solve this puzzle. Usually, U-Net is used to solve it. But DiT uses something called a transformer instead. One can think of U-Net as a way to organise and understand puzzle pieces. However, this may not be the best solution all the time. In simple words, DiT is like a new and improved tool for solving big puzzles, for instance, like understanding complicated pictures or data.
When it comes to Sora, the DiT here uses the concept of diffusion for predicting videos and the strength of transformers for next-level scaling. This can be further broken down into – what happens to Sora after you give a prompt? And how does it employ the concept of diffusion transformers?
How does it all translate into videos?
Based on a LinkedIn post by Professor Tom Yeh, University of Colorado Boulder, here we attempt to simplify the process of prompt to video. Let’s imagine that you have a prompt, ‘Sora is sky’. Once you enter it, Sora splits a related video (from its dataset) into small parts called patches, similar to breaking it down into smaller puzzle pieces. Later, each patch is turned into a simpler version, like summarising it, which helps the model understand the video better.
In the next step, some random elements (noise) is added to the summarised parts to make things interesting. Then comes the conditioning stage, where the prompt ‘Sora is sky’ is turned into numbers and are mixed up. This essentially helps the model adjust the video based on the prompt. In the next stage, the models use a special function to focus on different parts of the video and figure out what’s important.
Later in the attention pooling stage, the model focuses on the important parts of the video based on the prompt and random noise added. Using all the information, the model tries to guess what the noise may look like in different parts of the video. Now, the model pays attention to all the key details in the video, combines them and makes guesses about what should come next. In case the guess isn’t perfect, Sora learns from its mistakes and tries to do better. Finally, in the last stage Sora reveals the finished video without all the extra noise, making it look smooth and clear.
In simple words, DiT helps Sora understand text prompts and make cool videos by breaking them down into smaller parts, adding a bit of randomness, and then cleaning things up based on the text.
Advantages of DiT
DiT deploys transformers in a latent diffusion process, where noise is gradually transformed into the target image. This is done by reversing the diffusion process guided by a transformer network. The concept of diffusion timesteps is a key aspect of DiT. To simplify this, you have a tool called DiT which helps you make pictures. It works by using something called transformers to change a simple picture bit by bit into something you want. Think of it as cleaning a blurry image step by step. The diffusion timesteps act like checkpoints. At each checkpoint, DiT looks at what the picture looks like and decides how to make it better. In simple words, it is like different stages of cooking – you add different spices at different times.
When it comes to scalability, DiT can handle larger input data without sacrificing performance. This would need efficient resource usage and maintaining sample quality. For e.g., in natural language tasks, input size can vary widely. A scalable DiT should handle this variation without performance loss. As data volume grows, DiT’s ability to scale will be key.