AI continues to generate plenty of light and heat. The best models in text and images—now commanding subscriptions and being woven into consumer products—are competing for inches. OpenAI, Google, and Anthropic are all, more or less, neck and neck.
It’s no surprise then that AI researchers are looking to push generative models into new territory. As AI requires prodigious amounts of data, one way to forecast where things are going next is to look at what data is widely available online, but still largely untapped.
Video, of which there is plenty, is an obvious next step. Indeed, last month, OpenAI previewed a new text-to-video AI called Sora that stunned onlookers.
But what about video…games?
Ask and Receive
It turns out there are quite a few gamer videos online. Google DeepMind says it trained a new AI, Genie, on 30,000 hours of curated video footage showing gamers playing simple platformers—think early Nintendo games—and now it can create examples of its own.
Genie turns a simple image, photo, or sketch into an interactive video game.
Given a prompt, say a drawing of a character and its surroundings, the AI can then take input from a player to move a character through its world. In a blog post, DeepMind showed Genie’s creations navigating 2D landscapes, walking around or jumping between platforms. Like a snake eating its tail, some of these worlds were even sourced from AI-generated images.
In contrast to traditional video games, Genie generates these interactive worlds frame by frame. Given a prompt and command to move, it predicts the most likely next frames and creates them on the fly. It even learned to include a sense of parallax, a common feature in platformers where the foreground moves faster than the background.
Notably, the AI’s training didn’t include labels. Rather, Genie learned to correlate input commands—like, go left, right, or jump—with in-game movements simply by observing examples in its training. That is, when a character in a video moved left, there was no label linking the command to the motion. Genie figured that part out by itself. That means, potentially, future versions could be trained on as much applicable video as there is online.
The AI is an impressive proof of concept, but it’s still very early in development, and DeepMind isn’t planning to make the model public yet.
The games themselves are pixellated worlds streaming by at a plodding one frame per second. By comparison, contemporary video games can hit 60 or 120 frames per second. Also, like all generative algorithms, Genie generates strange or inconsistent visual artifacts. It’s also prone to hallucinating “unrealistic futures,” the team wrote in their paper describing the AI.
That said, there are a few reasons to believe Genie will improve from here.
Whipping Up Worlds
Because the AI can learn from unlabeled online videos and is still a modest size—just 11 billion parameters—there’s ample opportunity to scale up. Bigger models trained on more information tend to improve dramatically. And with a growing industry focused on inference—the process of by which a trained AI performs tasks, like generating images or text—it’s likely to get faster.
DeepMind says Genie could help people, like professional developers, make video games. But like OpenAI—which believes Sora is about more than videos—the team is thinking bigger. The approach could go well beyond video games.
One example: AI that can control robots. The team trained a separate model on video of robotic arms completing various tasks. The model learned to manipulate the robots and handle a variety of objects.
DeepMind also said Genie-generated video game environments could be used to train AI agents. It’s not a new strategy. In a 2021 paper, another DeepMind team outlined a video game called XLand that was populated by AI agents and an AI overlord generating tasks and games to challenge them. The idea that the next big step in AI will require algorithms that can train one another or generate synthetic training data is gaining traction.
All this is the latest salvo in an intense competition between OpenAI and Google to show progress in AI. While others in the field, like Anthropic, are advancing multimodal models akin to GPT-4, Google and OpenAI also seem focused on algorithms that simulate the world. Such algorithms may be better at planning and interaction. Both will be crucial skills for the AI agents both organizations seem intent on producing.
“Genie can be prompted with images it has never seen before, such as real world photographs or sketches, enabling people to interact with their imagined virtual worlds—essentially acting as a foundation world model,” the researchers wrote in the Genie blog post. “We focus on videos of 2D platformer games and robotics but our method is general and should work for any type of domain, and is scalable to ever larger internet datasets.”
Similarly, when OpenAI previewed Sora last month, researchers suggested it might herald something more foundational: a world simulator. That is, both teams seem to view the enormous cache of online video as a way to train AI to generate its own video, yes, but also to more effectively understand and operate out in the world, online or off.
Whether this pays dividends, or is sustainable long term, is an open question. The human brain operates on a light bulb’s worth of power; generative AI uses up whole data centers. But it’s best not to underestimate the forces at play right now—in terms of talent, tech, brains, and cash—aiming to not only improve AI but make it more efficient.
We’ve seen impressive progress in text, images, audio, and all three together. Videos are the next ingredient being thrown in the pot, and they may make for an even more potent brew.
Image Credit: Google DeepMind