Breaking down language walls: ElevenLabs launches multilingual text-to-speech for diverse audiences

ElevenLabs, a year-old startup that is leveraging the power of machine learning for voice cloning and synthesis, today announced the expansion of its platform with a new text-to-speech model that supports 30 languages.

The expansion marks the platform’s official exit from the beta phase, making it ready to use for enterprises and individuals looking to customize their content for audiences worldwide. It comes more than a month after ElevenLabs’ $19 million series A round that valued the company at nearly $100M.

“ElevenLabs was started with the dream of making all content universally accessible in any language and in any voice. With the release of Eleven Multilingual v2, we are one step closer to making this dream a reality and making human-quality AI voices available in every dialect,” Mati Staniszewski, CEO and cofounder of the company, said in a statement.

“Eventually we hope to cover even more languages and voices with the help of AI and eliminate the linguistic barriers to content,” he added.

Eleven Multilingual v2: How is it useful?

ElevenLabs offers two main voice-focused AI products – Speech Synthesis and VoiceLab.

The former is a synthesis tool that generates natural-sounding speech from text inputs. The latter is an add-on of sorts that gives users the ability to clone their own voices or generate entirely new synthetic voices (by randomly sampling vocal parameters) for use with the synthesis tool.

Once a user creates their custom voice, they can plug it into the text-to-speech tool to convert any short or long-form content of their choice into their preferred speech – with no effort at all. As an alternative, they could also use a bunch of premade AI voices from the company or those created and shared publicly by the community.

In the early days, the synthesis tool started off with a model that produced speech just in English. Later, it was expanded to Eleven Multilingual version 1, which used text inputs and AI voices to generate speech in six languages: English, Polish, German, Spanish, French, Italian, Portuguese and Hindi.

Now, with the release of the Eleven Multilingual version 2, the offering can now synthesize speech in 30 more languages. This includes Korean, Dutch, Turkish, Swedish, Indonesian, Vietnamese, Filipino, Ukrainian, Greek, Czech, Finish, Romanian, Danish, Bulgarian, Malay, Hungarian, Norwegian, Slovak, Croatian, Classic Arabic and Tamil.

The move essentially means a person could clone their voice and use it to produce speech in dozens of languages targeting different markets.

According to ElevenLabs, the user has to enter the text in the language of their choice, select the voice they want (pre-made, synthetic or cloned) and adjust a few speech parameters. The model will automatically identify the written language and use the set parameters to generate speech in it. It also maintains the selected voice’s unique characteristics across all languages, including its original accent.

“Our model is able to understand the relations between words and adjust delivery based on context (‘contextual’ text-to-speech). Because there are no hardcoded voice features in the model, it can robustly predict thousands of voice characteristics while creating AI voices. This means the ElevenLabs model can take the text surrounding each generated utterance into account to maintain appropriate flow, rather than generating each utterance separately, which can create voices that sound robotic,” Staniszewski told VentureBeat.

Widespread applications of text-to-speech tool

Since its launch in beta, ElevenLabs has seen interest from both enterprises and creators and claims to have registered more than a million users worldwide. The latest launch is expected to not only boost the user base of the platform but also the volume of content it generates on a daily basis.

“We have a number of enterprise clients using our products and their use cases are varied: from voicing characters in video games to voicing customer service avatars, and from recording audiobooks to creating content for the visually impaired,” Staniszewski explained.

Most recently, the company collaborated with ArXiv to publish all their papers with an audio version for additional accessibility. It also partnered with Storytel to enhance the options available for audiobooks - offering additional AI voices alongside human narrators. At some point in the future, the CEO expects it may also be able to make dubbing an entire movie into multiple languages completely seamless, while preserving the accents and emotions of the original actors.

More to come

As part of this mission, ElevenLabs plans to expand its products with more languages and features, including a projects tool that will make it easier for users to structure and edit their long-form content. According to Staniszewski, it will add a “Google Docs” level of simplicity to generating speech from lengthier content.

“By the end of the year, we are also planning to release a beta version of our AI dubbing tool which will allow users to instantly convert speech from one language to another, all while preserving the original speakers’ voice,” he noted.

In this space of AI-powered voice and speech generation, ElevenLabs competes with players like MURF.AI, Play.ht and WellSaid Labs. According to Market US, the global market for such tools stood at $1.2 billion in 2022 and is estimated to touch nearly $5 billion in 2032, with a CAGR of slightly above 15.40%.