Deepgram, a Y Combinator graduate building tailored speech recognition models, today announced it has raised $12 million in series A financing. CEO and cofounder Scott Stephenson says the proceeds will bolster the development of Deepgram’s platform, which helps enterprises to process meeting, call, and presentation recordings. If all goes according to plan — if Deepgram’s scale eventually matches that of the competition — it could save organizations valuable time by spotlighting key results.
“Consumer-facing technologies like Alexa and Siri have set the stage for speech recognition,” said Stephenson. “However … pre-built speech recognition can only get you so far, and throwing resources at the problem won’t solve the issue either. At Deepgram, we’ve created an entirely different solution using end to end deep learning, resulting in a faster, much more accurate and reliable solution that truly addresses the needs of enterprise companies.”
Deepgram leverages a backend speech stack that eschews hand-engineered pipelines for heuristics, stats-based, and fully end-to-end AI processing, with hybrid models trained on PCs equipped with powerful graphics processing units. Each custom model is trained from the ground up and can ingest files in formats ranging from phone calls and podcasts to recorded meetings and videos. Deepgram processes the speech, which is stored in what’s called a “deep representation index” that groups sounds by phonetics as opposed to words. Customers can search for words by the way they sound and, even if they’re misspelled, Deepgram can find them.
Stephenson says Deepgram’s models automatically pick up things like microphone noise profiles, as well as background noise, audio encodings, transmission protocols, accents, valence (i.e., energy), sentiment, topics of conversation, rates of speech, product names, and languages. Moreover, he claims they can increase speech recognition accuracy by 30% compared with industry baselines while speeding up transcription by 200 times, and while handling thousands of simultaneous audio streams.
Soon, the models will become even more capable with the launch of two new features: real-time streaming and on-premises deployment. Real-time streaming will let customers analyze and transcribe speech as words are being spoken, while on-premises deployment will provide a private, deployable instance of Deepgram’s product for use cases involving confidential, regulated, or otherwise sensitive audio data.
Deepgram is far from the only player in a speech recognition market that’s anticipated to be worth $21.5 billion by 2024, according to Markets and Markets. Tech giants like Nuance, Cisco, Google, Microsoft, and Amazon offer real-time voice transcription and captioning services, as do startups like Otter. There’s also Verbit, which recently raised $31 million for its human-in-the-loop AI transcription tech; Oto, which last December snagged $5.3 million to improve speech recognition with intonation data; and Voicera, which has raked in over $20 million for AI that draws insights from meeting notes.
But according to Stephenson, Deepgram hasn’t had much trouble attracting customers. It has more than 30 currently, including Genesys, Memrise, Poly, Sharpen, and Observe.ai.
Wing VC led Deepgram’s series A raise, which saw participation from SAP.io, Y Combinator, and Nvidia and which brings the total raised to date to over $13 million. The San Francisco-based company was founded in 2015 by University of Michigan physics graduate Noah Shutty and Stephenson, a Ph.D. student who formerly worked on the University of California Davis’ Large Underground Xenon Detector (LUX/LZ), a large and sensitive dark matter detector, and who helped to develop the college’s Davis Xenon (DaX) dual-phase liquid xenon detector program.