VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Last week, Meta Platforms’ artificial intelligence research arm introduced Voicebox, a machine learning model that can generate speech from text. What sets Voicebox apart from other text-to-speech models is its ability to perform many tasks that it has not been trained for, including editing, noise removal and style transfer.

The model was trained using a special method developed by Meta researchers. While Meta has not released Voicebox due to ethical concerns about misuse, the initial results are promising and could power many applications in the future.

‘Flow Matching’

Voicebox is a generative model that can synthesize speech across six languages: English, French, Spanish, German, Polish and Portuguese. Like large language models (LLMs), it has been trained on a very general task that can be used for many applications. But while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts. 

>>Don’t miss our special issue: Building the foundation for customer data quality.<<


AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.


Learn More

Such a model can then be applied to many downstream tasks with little or no fine-tuning. “The goal is to build a single model that can perform many text-guided speech generation tasks through in-context learning,” Meta’s researchers write in their paper (PDF) describing the technical details of Voicebox.

The model was trained by Meta’s “flow matching” technique, which is more efficient and generalizable than diffusion-based learning methods used in other generative models. The technique enables Voicebox to “learn from varied speech data without those variations having to be carefully labeled.” Without the need for manual labeling, the researchers were able to train Voicebox on 50,000 hours of speech and transcripts from audiobooks.

The model uses “text-guided speech infilling” as its training goal, which means it must predict a segment of speech given its surrounding audio and the complete text transcript. Basically, it means that during training, the model is provided with an audio sample and its corresponding text. Parts of the audio are then masked and the model tries to generate the masked part using the surrounding audio and the transcript as context. By doing this over and over, the model learns to generate natural-sounding speech from text in a generalizable way.

Replicating voices across languages, editing out mistakes in speech, and more

Unlike generative models that are trained for a specific application, Voicebox can perform many tasks that it has not been trained for. For example, the model can use a two-second voice sample to generate speech for new text. Meta says this capability can be used to bring speech to people who are unable to speak, or customize the voices of non-playable game characters and virtual assistants.

Voicebox also performs style transfer in different ways. For example, you can provide the model with two audio and text samples. It will use the first audio sample as style reference and modify the second one to match the voice and tone of the reference. Interestingly, the model can do the same thing across different languages, which could be used to “help people communicate in a natural, authentic way — even if they don’t speak the same languages.”

The model can also do a variety of editing tasks. For example, if a dog barks in the background while you’re recording your voice, you can provide the audio and transcript to Voicebox and mask out the segment with the background noise. The model will use the transcript to generate the missing portion of the audio without the background noise.

The same technique can be used to edit speech. For example, if you have misspoken a word, you can mask that portion of the audio sample and pass it to Voicebox along with a transcript of the edited text. The model will generate the missing part with the new text in a way that matches the surrounding voice and tone.

One of the interesting applications of Voicebox is voice sampling. The model can generate various speech samples from a single text sequence. This capability can be used to generate synthetic data to train other speech processing models. “Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech, with 1 percent error rate degradation as opposed to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models,” Meta writes.

Voicebox has limits too. Since it has been trained on audiobook data, it does not transfer well to conversational speech that is casual and contains non-verbal sounds. It also doesn’t provide full control over various attributes of the generated speech, such as voice style, tone, emotion and acoustic condition. The Meta research team is exploring techniques to overcome these limitations in the future.

Model not released

There is growing concern about the threats of AI-generated content. For example, cybercriminals recently tried to scam a woman by calling her and using an AI-generated voice to impersonate her grandson. Advanced speech synthesis systems such as Voicebox could be used for similar purposes or other nefarious deeds, such as creating fake evidence or manipulating real audio.

“As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm,” Meta wrote on its AI blog. Due to these concerns, Meta did not release the model but provided technical details on the architecture and training process in the technical paper. The paper also contains details about a classifier model that can detect speech and audio generated by Voicebox, to mitigate the risks of using the model. 

GamesBeat's creed when covering the game industry is "where passion meets business." What does this mean? We want to tell you how the news matters to you -- not just as a decision-maker at a game studio, but also as a fan of games. Whether you read our articles, listen to our podcasts, or watch our videos, GamesBeat will help you learn about the industry and enjoy engaging with it. Discover our Briefings.