Earlier this month, Meta announced an audio-based generative AI system called Voicebox.
Meta
In a blog post published last week, Facebook parent company Meta announced Voicebox. According to Meta, Voicebox is “the most versatile AI for speech generation.” The software was built with an eye towards assisting with audio editing, sampling, and styling—and accessibility.
What ChatGPT is to text and DALL-E is to art, Voicebox is to audio.
“Voicebox can produce high quality audio clips and edit pre-recorded audio—like removing car horns or a dog barking—all while preserving the content and style of the audio,” Meta wrote in describing the new software. “The model is also multilingual and can produce speech in six languages.” The model was trained using over 50,000 hours of unfiltered audio in English, French, Spanish, German, Polish, and Portuguese.
Voicebox has a 1% error rate degradation, according to Meta.
In the post, Meta says Voicebox can help accomplish a variety of tasks, but highlights for major use cases: contextual text-to-speech synthesis, speech editing and noise reduction, cross-lingual style transfer, and diverse speech sampling. Given Meta’s statement in the post that Voicebox represents “an important step forward in our generative AI research,” one can’t help but be tantalized by the accessibility-related implications of Voicebox. Meta admits to as much at the beginning of their announcement, writing in part that Voicebox could “allow visually impaired people to hear written messages from friends in their voices.”
It’s also worth wondering about the so-called diverse speech sampling capability. Although Meta characterizes it in terms of foreign language speakers, the reality is another way people “talk in the real world” is using atypical speech patterns like stutters. Accents are one thing, but speech impairments are another. Speech delays are disabilities too, and given the preponderance of voice-first interfaces like Alexa and Siri—not to mention in mixed reality headsets like Apple’s forthcoming Vision Pro—it’s interesting to consider how well a tool like Voicebox can work with an audio sample that doesn’t contain fluent (aka typical) speech. If, as Meta claims, Voicebox is purportedly designed to be representative of how humans talk in the real world, it’s logical to question accessibility.
Meta’s insistence that Voicebox potentially could positively influence accessibility is not a mere footnote to this news. However valid concerns are about the negative ramifications of artificially intelligent systems like ChatGPT and the like, media coverage would be well-served by being more balanced. To wit, for every dystopian piece put out into the world, the disability community is salivating over AI’s potential to do good. Voicebox is one example, as is ChatGPT for making doing research not merely more convenient—but more accessible too. On the latter point, it’s something Microsoft’s chief accessibility officer, Jenny Lay Flurrie, sees for herself in her personal life. In an interview with me back in April, she shared an anecdote of her autistic daughter, who used the ChatGPT-enhanced Bing to do research for an essay for her high school English class. Indeed, Flurrie is very bullish on chatbots’ future.
“[AI chatbots] collate so much information for you very, very quickly. It can save a lot of time,” Flurrie said of chatbots’ notable accessibility gains for disabled people. “If you think about someone from a mobility perspective, you can get the right level of information at your fingertips with a couple of clicks as opposed to having to conduct 10 to 20 different searches and go to multiple websites; it can be right there for you. It’s going to be very impactful for particularly neurodiversity… I think about dyslexia [and] dyspraxia. There’s a learning process to it. We’re definitely learning as we go [and learning] how to get the best out of the tools. I think there are some pretty profound implications.”
AI’s ever-burgeoning rise in prominence (and in capability) doesn’t mean we’re destined to be enslaved by our sentient computer overlords.
According to a report from Digital Trends’ Fionna Agomuoh last week, Meta currently has no plans to release Voicebox or its source code to the public. As Agomuoh noted, the FBI has grown increasingly concerned with so-called “deep fake content.” The law enforcement agency has issued warnings over time on crimes involving, as Agomuoh wrote in her story, “extortion, blackmail, and harassment.”
Meta said in part the decision to withhold Voicebox’s public release comes down to it being “necessary to strike the right balance between openness with responsibility” when building AI-based technologies.