Deepgram raises $25 million to build custom enterprise speech recognition models

Deepgram, a Y Combinator graduate building custom speech recognition models, today announced that it raised $25 million in series B funding led by Tiger Global. CEO and cofounder Scott Stephenson says the proceeds will bolster the development of Deepgram's platform, which enables enterprises to process meetings, calls, and presentations in real time.

The voice and speech recognition tech market is anticipated to be worth $31.82 billion by 2025, driven by new applications in the banking, health care, and automotive industries. In fact, it's estimated that one in five people in the U.S. interact with a smart speaker on a daily basis and that the share of Google searches conducted by voice in the country recently surpassed 30%.

San Francisco-based Deepgram was founded in 2015 by University of Michigan physics graduate Noah Shutty and Stephenson, a Ph.D. student who formerly worked on the University of California Davis' Large Underground Xenon Detector, a large and sensitive dark matter detector, and who helped to develop the university's Davis Xenon dual-phase liquid xenon detector program. The company's platform leverages a backend that eschews hand-engineered pipelines for heuristics, stats-based, and fully end-to-end AI processing, with hybrid models trained on PCs equipped with high-end GPUs.

"Over the past year, we have seen the speech recognition market evolve like never before," Stephenson wrote in a blog post. "When we first announced our series A in March 2020, enterprises were starting to recognize the impact a tailored approach to speech could have on their business. Yet, there was no 'race to space' moment driving companies to adopt a new solution, especially when their existing provider was working 'fine.' That quickly changed when COVID-19 hit. Companies were at an inflection point and forced to fast track digital transformation initiatives, compressing years of well-thought-out plans into mere months, and quickly transitioning teams to a remote workforce."

Each of Deepgram's models is trained from the ground up and can ingest files in formats ranging from calls and podcasts to recorded meetings and videos. The platform processes the speech, which is stored in what's called a "deep representation index" that groups sounds by phonetics as opposed to words. Customers can search for words by the way they sound; even if they're misspelled, Deepgram can often find them.

Stephenson says that Deepgram's models pick up things like microphone noise profiles as well as background noise, audio encodings, transmission protocols, accents, valence (i.e., energy), sentiment, topics of conversation, rates of speech, product names, and languages. Moreover, he claims they can increase speech recognition accuracy by 30% compared with industry baselines while speeding up transcription by 200 times, and while handling thousands of simultaneous audio streams.

Deepgram's real-time streaming capability lets customers analyze and transcribe speech as words are being spoken. Meanwhile, its on-premises deployment option provides a private, deployable instance of Deepgram's product for use cases involving confidential, regulated, or otherwise sensitive audio data.

Deepgram currently has more than 60 customers including Genesys, Memrise, Poly, Sharpen, and Observe.ai. The company grew its headcount from 9 to 95 across offces in the U.S. and Philippines and processed more than 100 billion spoken words. Deepgram also launched a new training capability, Deepgram AutoML, to further streamline model development.

Stephenson says that in 2020, Deepgram's annual recurring revenue grew three times. He forecasts another three times gain from 2020 to 2021.

"We have spent the last year investing in key capabilities across data acquisition, labeling, model training, our API and we're ready to scale. Big data and cloud computing has allowed us to collect massive amounts of customer and employee data from emails, forms, websites, apps, chat, and SMS," Stephenson said. "This structured data is what companies can currently see and use, and it's just the tip of the iceberg. ... We train our speech models to learn and adapt under complex, real-world scenarios, taking into account customers' unique vocabularies, accents, product names, and background noise. This new funding will support our efforts to deliver higher accuracy, improved reliability, real-time speeds, and massive scale at an affordable price for our customers."

Citi Ventures also participated in Deepgram's funding round announced today, along with Wing VC, SAP.io, and Nvidia Inception GPU Ventures. It brings the startup's total raised to date to over $38.9 million.

It's worth noting that Deepgram is far from the only player in the burgeoning speech recognition market. Tech giants like Nuance, Cisco, Google, Microsoft, and Amazon offer real-time voice transcription and captioning services, as do startups like Otter. There's also Verbit, which recently raised $31 million for its human-in-the-loop AI transcription tech; Oto Systems, which in December 2019 snagged $5.3 million to improve speech recognition with intonation data; and Voicera, which has raked in over $20 million for AI that draws insights from meeting notes.