WellSaid raises $10M to generate synthetic voices

WellSaid Labs, a startup developing synthetic voice technology, today announced it has raised $10 million in a series A round led by Fuse, with participation from Voyager, Qualcomm Ventures, and GoodFriends. The round, which was oversubscribed, will support the company's R&D and grow its team, according to CEO Matt Hocking.

Creating natural-sounding speech from text is considered a grand challenge in the field of AI and has been a research goal for decades. Content creators and product designers have long faced tradeoffs between quality and scalability when using text-to-speech tools versus human voiceovers. But with AI, creators, product developers, and brands have the potential to power experiences with a wide variety of voice styles, accents, and languages at scale. Startups creating virtual beings, or artificial people powered by AI, have collectively raised more than $320 million in venture capital to date.

WellSaid launched in 2018 as a research project at the Allen Institute of Artificial Intelligence, a lab started by Microsoft cofounder Paul Allen with the mission of conducting pivotal AI research and engineering. WellSaid's team set out to create the most lifelike synthetic voices, with CTO Michael Petrochuck leading R&D to build the key AI.

"What started as a research project ... is now a growth-stage startup with thousands of customers in media and advertising, technology, manufacturing, defense, pharmaceuticals, healthcare, and education," Hocking told VentureBeat via email. "In terms of the fundamentals of the business, [due to the pandemic] our mid-market and enterprise customers [have] accelerated and shifted a substantial amount of their voiceover and media productions from in-person to remote locations. This added more moving pieces and quality issues to their productions."

AI-powered speech

Using WellSaid, companies can pick from a range of voice avatars and create voiceovers straight from a script, with one or many voices based on style, gender, and production type. They're able to make edits to the copy, change the pausing, or use a different voice and teach the platform to say terms with unique spellings and pronunciations. WellSaid also allows users to share projects and files with team members, as well as building voice avatars for branded content, creating avatars from the voice of a real person with only a few hours of recordings.

Over two years, WellSaid incrementally improved the naturalness of its synthetic voices, aiming for "human parity," according to Hocking. In a July 2019 study, the company asked participants to listen to a set of randomized recordings created by WellSaid and by human voice actors and rank them on a scale of 1 to 5, with 5 being the highest quality. The voice actors achieved an average rating of around 4.5, while WellSaid's voices earned scores close to their human counterparts (4.282).

The current focus for Seattle, Washington-based WellSaid, which has 12 employees, is improving the platform's handling of different text lengths and styles, as well as speeding up voice generation. The company said it takes about 4 seconds to create a 10-second audio file.

"Enterprises use WellSaid Studio to create voiceovers for training and corporate content. They choose WellSaid to optimize their workflows because of the high-quality voices available and to gain cost efficiencies," Hocking continued. "Product developers integrate [our] API to their experiences to enable voice across their user experience. They rely on the quality of the voices, scalability of the infrastructure, and real-time rendering unmatched by other providers. [As for] brands and creators, [they] use WellSaid to create their own and exclusive AI voice avatars to spec. We partner with them to design, build, host, and deploy their unique AI voices according to their needs and production specs."

WellSaid's technology and comparable offerings from Microsoft, Amazon, Resemble AI, Synthesia, Deepdub, Papercup, and others have fueled concerns around misuse and deepfakes, or synthetic media used for nefarious purposes like imitating executives during earnings calls. But Hocking said WellSaid doesn't create voice avatars without actors' permission and subscribes to the "Hippocratic Oath for AI" proposed by Microsoft executives Brad Smith and Harry Shum.

"With WellSaid, companies that might have not been ready to deploy synthetic media can now invest in the technology, as it gives them the ability to continue to produce and publish mission-critical content without sacrificing quality," Hocking said. "We are proud of what we've accomplished and grateful for the business we've built."

This latest round brings WellSaid's total raised to date to $12 million.

AI-powered speech

More