The Rise of Meta’s Voicebox - A New Era of Speech Synthesis

· 4 min read
Meta’s new ‘Voicebox’ AI is a text-to-speech tool
Meta’s new ‘Voicebox’ AI is a text-to-speech tool /

Today, we stand on the brink of an extraordinary era where the dreams of immortality for our beloved celebrities are becoming a tangible reality. Meta has proudly introduced Voicebox, an awe-inspiring text-to-speech marvel that holds the potential to revolutionize the spoken word.

This remarkable achievement follows in the footsteps of the renowned ChatGPT and Dall-E, which have already captivated us with their prowess in generating text and images.

The Beatles Resurrect John Lennon for “Final Song” Using Artificial Intelligence
Uncover the remarkable technology behind the resurrection of John Lennon’s voice through AI innovation.

Voicebox is a great text-to-output generator unlike any other. Instead of weaving intricate narratives or conjuring captivating visuals, this ingenious creation breathes life into audio clips.

Meta defines this technological wonder as "a non-autoregressive flow-matching model trained to infill speech, given audio context and text." Through an arduous training process spanning over 50,000 hours of unfiltered audio, Voicebox has honed its skills.

Voicebox is a great text-to-output generator unlike any other.
Voicebox is a great text-to-output generator unlike any other. / Meta

Meta curated an eclectic array of recorded speech and transcripts from a treasure trove of public domain audiobooks encompassing English, French, Spanish, German, Polish, and Portuguese.

The beauty of this diverse dataset lies in its ability to empower Voicebox to generate remarkably natural-sounding speech, irrespective of the languages spoken by the participants involved.

Introducing Voicebox: The Most Versatile AI for Speech Generation / Meta

The researchers proudly proclaim that "speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech." Even more astonishing is the mere 1 percent degradation in error rate exhibited by the computer-generated speech, in stark contrast to the staggering 45 to 70 percent decline witnessed with existing text-to-speech models.

Unveiling the inner workings of Voicebox, the Meta researchers share their profound insights. The system was first trained to predict speech segments by comprehending the surrounding segments alongside the corresponding transcript. Armed with the wisdom gained from this contextual understanding, the model effortlessly extends its capabilities to various speech-generation tasks.

Voicebox can resynthesize the portion of speech corrupted by short-duration noise, or replace misspoken words without having to rerecord the entire speech / Meta

It can seamlessly create new portions during an audio recording, eliminating the need to recreate the entire input. Moreover, Voicebox has the uncanny ability to edit audio clips actively, expertly removing unwanted noise or replacing misspoken words. In a manner reminiscent of skilled photographers wielding image-editing software, users can identify the corrupted segment of speech, promptly crop it, and instruct the model to regenerate that very segment.

While text-to-speech generators have graced our lives for quite some time, epitomized by the familiar voices guiding us through our parents' TomToms, the likes of Speechify or Elevenlab's Prime Voice AI have undoubtedly pushed the boundaries further.

Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles / Meta

Nonetheless, these advanced systems still demand excessive source material to meticulously imitate their subjects and an additional mountain of data for each new subject they encounter.

However, Voicebox defies the norm with its revolutionary zero-shot text-to-speech training method dubbed Flow Matching, a testament to Meta's relentless pursuit of innovation. Benchmark results astoundingly demonstrate that Meta's AI outshines the current state of the art in intelligibility, boasting a mere 1.9 percent word error rate compared to the industry average of 5.9 percent.

Voicebox can generate speech that is more representative of how people talk / Meta

Furthermore, the AI's "audio similarity" garners an impressive composite score of 0.681, triumphing over the existing standard of 0.580. Perhaps even more astonishing is that Voicebox operates up to 20 times faster than the most cutting-edge TTS systems available today.

Despite these extraordinary accomplishments, Meta has decided not to release the Voicebox app or its source code to the public. Citing potential risks of misuse, the company proceeds with caution. Nevertheless, they generously share a tantalizing glimpse into the future through compelling audio examples and the release of the program's initial research paper.

With great anticipation, we envision the remarkable technology finding its way into prosthetics for individuals with vocal cord damage, breathing life into in-game non-player characters, and seamlessly assisting us as digital companions on our extraordinary journeys.

Voicebox achieves new state-of-the-art results, outperforming Vall-E and YourTTS on word error rate. / Meta

As a voiceover artist, I must admit that the unveiling of Voicebox and its extraordinary capabilities fills me with awe and trepidation. While I am undeniably captivated by the sheer brilliance of this technological breakthrough, a tinge of unease lingers within me.

It is hard not to feel vulnerable as I contemplate the potential implications for our profession. Thinking of a future where synthetic counterparts may replace our voices evokes a bittersweet blend of wonder and concern. Yet, amidst these swirling emotions, I remain hopeful that our unique artistic abilities and the essence of human expression will continue to hold their irreplaceable value in this evolving landscape.

Sources: /