Was a long read? Making it simpler…
DeepMind’s AI creates soundtracks for videos
What’s the story
Google’s DeepMind has introduced a new artificial intelligence (AI) tool, capable of generating soundtracks for videos.
The tool utilizes both video content and text prompts to create audio.
This allows users to produce scenes with a drama score, realistic sound effects, or dialog that aligns with the characters, as well as the tone of a video.
DeepMind’s website showcases examples of the AI tool’s capabilities.
How does it work?
The AI tool can generate audio based on specific text prompts.
For example, Google used the prompt “cars skidding, car engine throttling, angelic electronic music” to create a soundtrack for a video showing a car driving through a cyberpunk-style cityscape.
Another example involved creating an underwater soundscape using the prompt “jellyfish pulsating under water, marine life, ocean.”
Despite the option to use text prompts, they are not mandatory for using this tool.
The tool offers unlimited audio options
Users of DeepMind’s new AI tool are not required to precisely align the generated audio with corresponding scenes in the video.
The tool can produce an unlimited number of soundtracks for videos, providing users with endless audio options.
This feature sets it apart from other similar tools such as ElevenLabs’s sound effects generator, which also uses text prompts to generate audio.
It could simplify audio-video pairing
The AI tool was trained on audio, video, and annotations containing detailed descriptions of sound as well as transcripts of spoken dialog.
This training permits the video-to-audio generator to match audio events with visual scenes accurately.
It could potentially simplify the procedure of pairing audio with AI-generated video from tools like DeepMind’s Veo and Sora.
However, there are some limitations to this tool that DeepMind is currently working on improving.
It is undergoing improvements and testing
One of the limitations of DeepMind’s new AI tool is its ability to synchronize lip movement with dialog, which is currently being improved.
The quality of the video-to-audio system is also dependent on video quality; grainy or distorted videos can result in a noticeable drop in audio quality.
The tool is not yet available for general use as it still needs to undergo rigorous safety assessments and testing.