AI video generators like OpenAI’s Sora, Luma AI’s Dream Machine, and Runway Gen-3 Alpha have been stealing the headlines lately, but a new Google DeepMind tool could fix the one weakness they all share – a lack of accompanying audio.
A new Google DeepMind post has revealed a new video-to-audio (or ‘V2A’) tool that uses a combination of pixels and text prompts to automatically generate soundtracks and soundscapes for AI-generated videos. In short, it’s another big step toward the creation of fully-automated movie scenes.
As you can see in the videos below, this V2A tech can combine with AI video generators (including Google’s Veo) to create an atmospheric score, timely sound effects, or even dialogue that Google DeepMind says “matches the characters and tone of a video”.
Creators aren’t just stuck with one audio option either – DeepMind’s new V2A tool can apparently generate an “unlimited number of soundtracks for any video input” for any scene, which means you can nudge it towards your desired outcome with a few simple text prompts.
Google says its tool stands out from rival tech thanks to its ability to generate audio purely based on pixels – giving it a guiding text prompt is apparently purely optional. But DeepMind is also very aware of the major potential for misuses and deepfakes, which is why this V2A tool is being ringfenced as a research project – for now.
DeepMind says that “before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing”. It will certainly need to be rigorous, because the ten short video examples show that the tech has explosive potential, for both good and bad.
The potential for amateur filmmaking and animation is huge, as shown by the ‘horror’ clip below and one for a cartoon baby dinosaur. A Blade Runner-esque scene (below) showing cars skidding through a city with an electronic music soundtrack also shows how it could drastically reduce budgets for sci-fi movies.
Concerned creators will at least take some comfort from the obvious dialogue limitations shown in the ‘Claymation family’ video. But if the last year has taught us anything, it’s that DeepMind’s V2A tech will only improve drastically from here.
Where we’re going, we won’t need voice actors
The combination of AI-generated videos with AI-created soundtracks and sound effects is a game-changer on many levels – and adds another dimension to an arms race that was already white hot.
OpenAI has already said that it has plans to add audio to its Sora video generator, which is due to launch later this year. But DeepMind’s new V2A tool shows that the tech is already at an advanced stage and can create audio based purely on videos alone, rather than needing endless prompting.
DeepMind’s tool works using a diffusion model that combines information taken from the video’s pixels and the user’s text prompts then spits out compressed audio that’s then decoded into an audio waveform. It was apparently trained on a combination of video, audio, and AI-generated annotations.
Exactly what content this V2A tool was trained on isn’t clear, but Google clearly has a potentially huge advantage in owning the world’s biggest video-sharing platform, YouTube. Neither YouTube nor its terms of service are completely clear on how its videos might be used to train AI, but YouTube’s CEO Neal Mohan recently told Bloomberg that some creators have contracts that allow their content to be used for training AI models.
Clearly, the tech still has some limitations with dialogue and it’s still a long way from producing a Hollywood-ready finished article. But it’s already a potentially powerful tool for storyboarding and amateur filmmakers, and hot competition with the likes of OpenAI means it’s only going to improve rapidly from here.