OpenAI, the developer of ChatGPT, has introduced SORA, an AI technology that turns text into video. The technology is currently only available to a limited number of people, but it is a step up from AI’s ability to create images.
Advances in Generative AI
Generative AI is a branch of AI that uses AI technologies, including deep learning, to create new content. It refers to the creation of content (text, images, video, audio, computer code, synthetic data, workflows, and models of physical objects) from a variety of disciplines that can be processed by computers, including language and vision.
It’s no exaggeration to say that the phenomenal growth of generative AI began with the advent of Transformers. Transformer is a generative model introduced by Google in 2017. It analyzes the meaning, position, and relationship between words in a text, such as a sentence or paragraph. It is a neural network with an encoder-decoder structure that learns context and meaning based on that. It uses the attention or self-attention technique. This Transformer model is the backbone of ChatGPT.
In October 2020, the Vision Transformer was also announced for the visual domain. In addition, a number of multimodal pre-trained models (Vision-Language Pre-Trained Models) have been announced that use text and image pairs, which is a departure from the text-based pre-training method. In multimodal pre-training models, language is represented by conventional embeddings, and images are embedded by dividing them into patches. In addition, contrastive learning-based models that have learned the correlation between images and text in advance are being investigated.
Technology Trends in Generative AI
Generative AI is currently being applied not only to images, but also to video, 3D models, and audio. DALL-E 3, Midjourney, Imagen, Parti, CLIP, and others can generate images based on text input. Adobe Firefly’s generative AI tool, Generative Fill, can also be used to retouch photos. You select an area to be retouched, enter text, and it generates an image to fill the area.
In language, research continues on large-scale language models such as ChatGPT, Gemini, and LLaMA, which are based on transformational structures. In the area of music generation, Google has developed MusicLM and MusicVAE. Other solutions include Musegan, FlowComposer, DeepBach, and DeepJazz.p
A 4-second video was created using Gen-2 on the runway. It was created by typing in “cute puppy playing computer in the desert”/Courtesy of ETRI.
There is also a lot of movement in the video space. Runway’s Gen-2, StabilityAI’s Stable Video Diffusion, Google’s Lumiere, and OpenAI’s SORA are technologies that can turn text and images into video. Meta also presented Emu, an AI that can edit images and create videos.
ETRI is also participating in the generative AI market. ETRI presented three models of KOALA, which generates images five times faster than OpenAI’s DALL-E 3, and two models of Ko-LLaVa, an interactive visual language model that can answer questions while viewing images and videos. Future applications include games, movies, music, and design,
Source: ETRI Webzine
저작권자 © Korea IT Times 무단전재 및 재배포 금지