Soniox taps unsupervised learning to build speech recognition systems

AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But a new entrant launching out of beta this week claims its approach yields superior accuracy. Called Soniox, the company leverages vast amounts of unlabeled audio and text to teach its algorithms to recognize speech with accents, background noises, and "fairfield" recording. In practice, Soniox says its system correctly transcribes 24% more words compared with other speech-to-text systems, achieving "super-human" recognition on "most domains of human knowledge."

Those are bold claims, but Soniox founder and CEO Klemen Simonic says the accuracy improvements arise from the platform's unsupervised learning techniques. With unsupervised learning, an algorithm -- in Soniox's case, a speech recognition algorithm -- is fed "unknown" data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

Unsupervised speech

At the advent of the modern AI era, when people realized powerful hardware and datasets could yield strong predictive results, the dominant form of machine learning fell into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data, predict outcomes, and more.

Simonic, a former Facebook researcher and engineer who helped to build the speech team at the social network, notes that supervised learning in text-to-speech is both time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

"Google and Facebook have more than 50,000 hours of transcribed audio. One has to invest millions -- more like tens of millions -- of dollars into collecting transcribed data," Simonic told VentureBeat via email. "Only then one can train a speech recognition AI on the transcribed data."

A technique known as semi-supervised learning offers a potential solution. It can accept partially labeled data, and Google recently used it to obtain state-of-the-art results in speech recognition. In the absence of labels, however, unsupervised learning -- also known as self-supervised learning -- is the only way to fill gaps in knowledge.

According to Simonic, Soniox's self-supervised learning pipeline sources audio and text from the internet. In the first iteration of training, the company used the Librispeech dataset, which contains 960 hours of transcribed audiobooks.

Soniox's iterative approach continuously refines the platform's algorithms, enabling them to recognize more words as the system gains access to additional data. Currently, Soniox's vocabulary spans different people, places, and geography to domains including education, technology, engineering, medicine, health, law, science, art, history, food, sports, and more.

"To do fine-tuning of a particular model on a particular dataset, you would need an actual transcribed audio dataset. We do not require transcribed audio data to train our speech AI. We do not do fine-tuning," Simonic said.

Dataset and infrastructure

Soniox claims to have a proprietary dataset containing over 88,000 hours of audio and 6.6 billion words of preprocessed text. By comparison, the latest speech recognition works from Facebook and Microsoft used between 13,100 and 65,000 hours of labeled and transcribed speech data. And Mozilla's Common Voice, one of the largest public annotated voice corpora, has 9,000 hours of recordings.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook announced SEER, an unsupervised model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks.

Soniox collects more data on a weekly basis, with the goal of expanding the range of vocabulary the platform can transcribe. However, Simonic points out that more audio and text isn't necessarily required to improve word accuracy. Soniox's algorithms can "extract" more about familiar words with multiple iterations, essentially learning to recognize particular words better than before.

AI has a well-known bias problem, and unsupervised learning doesn't eliminate the potential for bias in a system's predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Simonic says Soniox has taken care to ensure its audio data is "extremely diverse," with speakers from most countries and accents around the world represented. He admits that the data distribution across accents isn't balanced but claims the system still manages to perform "extremely well" with different speakers.

Soniox also built its own training hardware infrastructure, which it stores across multiple servers located in a collocation datacenter facility. Simonic says the company's engineering team installed and optimized the system and machine learning frameworks and wrote the inference engine from scratch.

"It is utterly important to have under control every single bit of transfer and computation when you are training AI models at large scale. You need a rather large amount of computation to do just one iteration over a dataset of more than 88,000 hours," Simonic said. "[The inferencing engine] is highly optimized and can potentially run on any hardware. This is super important for production deployment because speech recognition is computationally expensive to run compared to most other AI models and saving every bit of computation on a large volume amounts to large sums in savings -- think of millions of hours of audio and video per month."

Scaling up

After launching in beta earlier this year, Soniox is making its platform generally available. New users get five hours per month of free speech recognition, which can be used in Soniox's web or iOS app to record live audio from a microphone or upload and transcribe files. Soniox offers an unlimited number of free recognition sessions for up to 30 seconds per session, and developers can use the hours to transcribe audio through the Soniox API.

It's early days, but Soniox says it recently signed its first customer in DeepScribe, a transcription startup targeting health care. DeepScribe switched from a Google speech-to-text model because Soniox's transcriptions of doctor-patient conversations were more accurate, Simonic claims.

"To make a business, developing novel technology is not enough. Thus we developed services and products around our new speech recognition technology," Simonic said. "I expect there will be a lot more customers like DeepScribe once the word about Soniox gets around."