Microsoft recently unveiled its cutting-edge text-to-speech AI language model VALL-E, which it claims can mimic any voice — including its emotional tone, vocal timbre and even the background noise — after training using just three seconds of audio.
The researchers believe VALL-E could work as a high-quality text-to-speech synthesizer, as well as a speech editor that could doctor audio recordings to include phrases not originally said. Coupled with generative AI models like OpenAI’s GPT-3, the developers say VALL-E could even be used in original audio content creation.
The development has some experts sounding alarm bells over the technology’s implications for misuse; through VALL-E and other generative AI programs, malicious actors could mass produce audio-based disinformation at unprecedented scales, sources say.
How does VALL-E work?
Unlike previous speech synthesizers, most of which work by modulating waveforms to sound similar to human speech, VALL-E functions by analyzing a short voice sample to generate the most likely representation of what that voice might sound like, based on its thousands of hours of training data, reads Microsoft’s paper.
To provide enough data to match almost any voice sample imaginable, VALL-E was trained on a whopping 60,000 hours of speech from over 7,000 unique speakers, using Meta’s LibriLight audio library — in comparison, current text-to-speech systems average less than 600 hours of training data, the authors wrote.
The result, according to researchers, is a model that outperforms current state-of-the-art text-to-speech generators in terms of “speech naturalness and speaker similarity.”
Samples of the model’s capabilities are available online. While some voice prompts seemed obviously fake, others approached and even achieved natural-sounding speech. As AI continues to develop at a breakneck pace, some experts believe VALL-E could soon provide near-perfect imitations of anybody’s voice.
VALL-E’s unveiling preceded reports that Microsoft allegedly plans to invest $10 billion in OpenAI, the Elon Musk-cofounded startup that created GPT-3 (one of the most powerful language models available) and its mega-viral chatbot application, ChatGPT. It’s unclear whether development of VALL-E impacted the decision.
Ease of use
Brett Caraway, an associate professor of media economics at the University of Toronto, said voice mimicking synthesizers already exist — but they require a great deal of clean audio data to pull off convincing speech.
With technology like VALL-E, however, anyone could achieve the same results with a couple seconds of audio.
“VALL-E lowers the threshold or barrier to replicating somebody else’s voice,” Caraway said. “So, in making it easier to do, it creates a risk of proliferation of content because more people will be able to do it more quickly with less resources.”
“It’s going to create a real crisis in managing disinformation campaigns. It’s going to be harder to spot and it’s going to be overwhelming in terms of the volume of disinformation potentially,” he said.
Loss of trust
Bad actors could pair an altered voice with manufactured video to make anyone appear to say anything, Caraway continued. Spam and scam callers could phone people pretending to be someone they’re not. Fraudsters could use it to bypass voice identification systems — and that’s just the tip of the iceberg. Eventually, Caraway is concerned “it could erode people’s trust across the board.”
Abhishek Gupta, founder and principal researcher at the Montreal AI Ethics Institute, agreed. Over email, he wrote: “There is the potential for erosion of our belief in provided testimony, evidence, and other attestations, since there is always the claim that someone can make that their voice’s likeness was replicated and they didn’t say any of the things that are being attributed to them.
“This further diminishes the health of the information ecosystem and makes trust a very tenuous commodity in society.”
Gupta also noted that artists who rely on their voice to make a living could be impacted, as it’s now possible to steal anyone’s voice for use in projects you’d previously need to pay them for.
How can we prevent harm?
Gupta believes it’s time to assemble a “multidisciplinary set of stakeholders who carry domain expertise across AI, policy-making and futures thinking” to proactively prepare for future challenges instead of simply reacting to every new advancement.
“Leaning in on existing research in the areas of accountability, transparency, fairness, privacy, security, etc. as it relates to AI can help alleviate the severity of the challenges that one might encounter in the space,” he continued.
Microsoft’s researchers acknowledged VALL-E’s potential for harm in their conclusion, saying its abilities “may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E.”
While he agreed it could help, Caraway is skeptical that solely relying on AI-detection software is enough: as detection models advance, so too will techniques to bypass said detection. Instead, he believes media-literacy education is the best solution — teaching kids from a young age how to find trustworthy information online.
“One of the things that I have been a proponent of is trying to institute media literacy and information literacy starting in preschool,” he said.
“I also think a key component here is recommitting to good journalism … not just in expression, but in terms of investment into quality journalism. We need it now more than ever.”