AI Made Friendly HERE

What are Text-to-Video Models in AI and How They are Changing the World

Within just two months of the first text-to-video tool being launched to the public, high-quality ads and short films are already being made on AI. Satyen K. Bordoloi scopes the field and argues that with AI filmmaking, we are at the next Lumière brothers’ moment.

On March 22, 1895, the Lumière Brothers showed the world the first ‘moving images’. They travelled the world showcasing their invention. A few years later, short films began to be made and shown using the technology, and about two decades later, feature-length films entered the zeitgeist.

Could the Lumière brothers ever have even imagined that in the future you wouldn’t even need a camera to make ‘moving images.’ (Image Credit: Wikipedia)

On February 15 this year, OpenAI teased their text-to-video AI model Sora. On June 10, Kuaishou Technology announced Kling in China. Two days later, on June 12, Luma AI became the first text-to-video (TTV) tool to open to everyone worldwide. A month and a half have passed since, and already several ads, short films, and trailers for films made entirely using AI have been released to the world, with the first feature film made entirely using AI expected before the year is over.

The Lumière brothers’ invention took almost two decades to percolate. But an equally groundbreaking leap in filmmaking – text-to-video – will do so in under a year. The juggernaut of mass-produced logic (artificial intelligence) creating specific art – AI doomsdayism notwithstanding – rolls on.

‘Director, Screenwriter and Editor Sumit Purohit is one of the first users of the AI’s text to video capabilities to the full extent in India. This trailer was created by him almost entirely using different AI tools.

THE CASE FOR TEXT-TO-VIDEO:

Leonardo da Vinci used paint. Emily Dickinson and Walt Whitman chiselled words into poetry, while Akira Kurosawa painted his poetry onto celluloid. Anyone can fiddle with any art form. But true mastery requires years of specific learning and, as every single artist would attest, decades of dedicated practice. What if you are not a ‘specificist’ but more a generalist? You have bits of Da Vinci, Dickinson, Whitman, and Kurosawa in you, and though you know all technicalities, you can’t get yourself to create art because maybe you’re not a master of any or are an introvert. Can a generalist ever become an artist, especially in cinema which requires specific skill sets? Turns out that with AI, you can give even a Kurosawa or Spielberg a run for their craft.

You can paint pictures with your words, not just in your mind, but on a screen. This is the realm of text-to-video, a technological marvel that is – as you saw in the examples above – terraforming the very way our ancient planet creates and consumes visual content – from social media reels to cinema. For individuals, the generalists, it is like having a personal director in their pockets, capable of turning their wildest imaginations into two-dimensional reality on any screen with just a few keystrokes.

THE TECHNOLOGY BEHIND TTV:

Text-to-video, as the name suggests, is the process of converting written text into video. You input text into your computer, algorithms decipher the meaning of the words you typed, then generate corresponding images (sometimes you can directly input an image), and seamlessly stitch them together with audio (if the AI software has the capability) to create moving images, aka videos.

The backbone of this technology lies in complex models like transformers and diffusion, which enable machines to understand and generate human-like text and images. Transformers, renowned for their prowess in natural language processing, break down text into meaningful units and capture relationships between words. Diffusion models, on the other hand, excel at generating images by gradually adding details to a noisy starting point. By combining these powerful tools, text-to-video systems can bring written descriptions to life with astonishing accuracy and creativity.

Luma AI became the first text-to-video platform released to the whole world to try.

THE BEST PLAYERS IN THE MARKET:

At this moment, the best companies whose AI models you can already use are Luma, Gen 3 Alpha from Runway ML, and Kling. OpenAI became the first to announce with Sora, but has not yet launched its product for the mass market. While Google’s Veo is in the offing and Pika is improving among a host of other companies cropping up across the world in this new field.

I have found dozens of other ‘AI companies’ (there are probably hundreds) who claim to be text-to-video generators. But all they do is take APIs from other text-to-video models and are mostly subpar. They are, in a way, fleecing paying customers because, hey – in the land of the blind, the one-eyed is king. Some even allow deepfakes to be made, which is dangerous as it could lead to nonconsensual pornography and other harmful material.

However, the three that I mentioned, Luma, Runway Gen 3 Alpha, and Kling, are the cream right now. Their platforms offer you the ability not just to experiment with this technology, but to create videos with varying degrees of control and customization. What you’ll feel when you see your own words or some image you have shot yourself or scraped from somewhere, turn into actual high-quality, at times cinematic video, is nothing short of the unbridled awe you felt when you watched a magic show for the first time.

This 60 video of a woman walking on the streets of Tokyo created entirely using AI, became our generation’s ‘Train arriving at the station’ (50 seconds long) moment of the Lumière Brothers.

USE CASES OF TTV:

To put it simply, anywhere there is a need for a video, you have a use case for text-to-video. From short social media content like reels to its use at first in SFX, VFX, and establishing shots in films, the use case of TTV is as wide as a creator’s imagination. But that is low-hanging fruit. Some truly creative uses can be thought of as well. For example, in education – a teacher can create a quick and cheap TTV video to illustrate a scientific principle. You can use TTV for film restoration. The missing 2 minutes in a 2-hour-old film can be recreated using AI, previous frames of the film, and your imagination. If you have the shooting script of a lost film full of camera instructions and a few people who had seen the actual film before it got lost in time – the director would be best – you could recreate an entire film using TTV.

HOW TTV WILL CHANGE THE WORLD:

The most obvious way TTV will do so is by democratizing filmmaking. Call it the advent of no-camera filmmaking that I have been harping about in earlier Sify articles. Today, to make a film, you at least have to have a camera or a laptop. You can have friends and family act, shoot with a cheap camera or mobile, edit on a laptop, and thus make an extremely low-budget film. However, you cannot make a film laden with special effects with just these. With TTV, though, even the camera element has been eliminated. You don’t even need a camera to make a film that almost reaches the quality of popular action films in Hollywood. You can turn your idea into a film with a few hundred dollars and an endless supply of your creative, malleable imagination.

Can you believe that this video was created entirely using AI by Sumit Purohit.

HOLLYLWOOD, BOLLYWOOD CRUMBLES:

Filmmaking is going to transform the way the music industry changed at the turn of the millennium. With the rise of the internet, the power of record labels to dictate what is ‘good music’ and curate your choices went away, and a ‘Long Tail’ emerged where people listened to what they liked, and even small creators could make it big. Of course, that also led to utter drivel being made in the name of music, and the general quality of both lyrics and songs has gone down as all you need is a phone to make a song today. Yet, it did give everyone the tools to make what they wanted, to experiment, and from that have also emerged some great musicians who would not have been around in the old world order.

This is exactly what will happen with cinema, with the advent of TTV. Anyone with a phone can make a feature film, and the world will be flooded with films that look good but may not be good. Filmmaking will become like blog posting or creating songs, as anyone could do it. Tools would emerge that would use AI to automate the whole process so that with almost zero technical skills, you and I could make our feature film.

Kling from Kuaishou Technology is perhaps the best text-to-video model released for public out there. While others can create short videos, Kling can make great quality videos up to 2 minutes long.

This will change the old film industries drastically. You think the tech companies as film production houses like Netflix or Amazon Prime are the peak change in cinema. Wait till you see what happens when it is injected with the steroids of AI. Old hierarchies of control will melt away like chains made of wax in the heat of this new technology.

In Indian cinema, particularly Bollywood, often talentless stars who command obscene money at the cost of thousands of technicians and actors who work on a film will be a thing of the past. Like Web 2.0 gave rise to the influencer and YouTube star, Web 3.0 powered by AI, especially TTV, will give rise to the next superstar of cinema. And for all you know, like in Andrew Niccol’s prophetic film S1m0ne, this superstar could be all imaginary and non-existent.

In the 1890s, when the Lumière Brothers travelled the world demonstrating their camera and the magic it can do, they couldn’t have fathomed that in just over a century and a quarter, their very magic would stand on the brink of being upended by a far greater magic. The world was made rich by the Lumière Brothers. It will be made richer still, with text-to-video.

In case you missed:

Originally Appeared Here

You May Also Like

About the Author:

Early Bird