AI-powered voice transcription startup Verbit secures $250M

Verbit, a startup developing an AI-powered transcription platform, today announced that it secured $250 million, bringing its total capital raised to $550 million. The round -- a series E, made up of a $150 million primary investment and $100 million in secondary transactions -- was led by Third Point Ventures with participation from Sapphire Ventures, More Capital, Disruptive AI, Vertex Growth, 40North, Samsung Next, and TCP.

With the fresh capital, Verbit, which is now valued at $2 billion, plans to expand its workforce while supporting product research and development as well as customer acquisition efforts. Beyond this, CEO Tom Livne said that Verbit will pursue further mergers and acquisitions and "provide enhanced value" to its media, education, corporate, legal, and government clients.

During the pandemic, enterprises ramped up their adoption of voice technologies, including transcription, as remote videoconferencing became the norm. In a survey from Speechmatics, a little over two-thirds of companies said that they now have a voice technology strategy. While they cited accuracy and privacy as concerns, 60% without a strategy said that they'd consider one within five years -- potentially driving the speech and voice recognition market to $22 billion in value by 2022.

Livne cofounded New York-based Verbit with Eric Shellef and Kobi Ben Tzvi in 2017. Shellef previously led speech recognition at Intel's wearables group, while Tzvi cofounded and served as CTO at facial recognition startup Foresight Solutions. As for Livne, who's also a member of Verbit's board, he was an early investor in counter-drone platform Convexum, which was acquired by NSO Group in 2020 for $60 million.

AI-powered transcription

Verbit’s voice transcription and captioning services aren't novel -- well-established players like Nuance, Cisco, Otter, Voicera, Microsoft, Amazon, and Google have offered rival products for years, including enterprise-focused platforms like Microsoft 365. But Verbit's adaptive speech recognition tech can generate transcriptions that it claims achieve higher accuracy than its rivals.

Verbit users upload audio or video to a dashboard for AI-powered processing. Then, a team edits and reviews the material -- taking into account customer-supplied notes and guidelines.

Finished transcriptions from Verbit are available for export to services like Blackboard, Vimeo, YouTube, Canvas, and Brightcove. A web frontend shows the progress of jobs and lets users edit and share files or define the access permissions for each, plus add inline comments, requesting reviews, or viewing usage reports.

"Verbit's in-house AI technology detects domain-specific terms, filters out background noise and echoes, and transcribes speakers regardless of accent to generate ... transcripts and captions from both live and recorded video and audio. Acoustic, linguistic, and contextual data is ... checked by our transcribers, who [incorporate] customer-supplied notes, guidelines, specific industry terms, and requirements," Livne told VentureBeat via email. "By indexing video content for web searches, Verbit [can help] companies improve SEO and increase their site traffic. [In addition, the platform can] provide audio visual translation to help global businesses with translations and to reach international audiences with their products and offerings."

The transcriber experience

Like its competition, Verbit relies on an army of crowdworkers to transcribe files. The company's roughly 35,000 freelancers and 600 professional captioners are paid in one of two ways, per audio minute or word. While Verbit doesn't post rates on its website, a source pegs transcription pay at $0.30 per audio minute. Two years ago, transcription service Rev faced a massive backlash when it slashed minimum rates for its transcribers from $0.45 to $0.30 per word transcribed.

In some cases, pay can dip below $0.30 on Verbit, according to employee reviews on Indeed. The company reportedly started paying as low as around $0.24 cents per audio minute last year for a standard job.

Transcription platforms also don't always have the technology in place to prevent crowdworkers from seeing disturbing content. In a piece by The Verge, crowdworkers on Rev said that they were exposed to graphic or troubling material on multiple occasions with no warning, including violent police recordings, descriptions of child abuse, and graphic medical videos.

A spokesperson told VentureBeat via email: "Currently, we employ a mix of full-time transcribers and captioners, as well as freelancers that are paid per audio minute. We’ve established a ranking system based on efficiency and accuracy to incentivize and reward freelancers with higher compensation in exchange for consistently delivering high-quality transcripts ... The company’s transcribers have a support system -- chat and forum -- that constantly relays feedback to Verbit management, and it has a bonus program to ensure proper compensation for its top performers."

The spokesperson continued: "In addition to competitive pay and opportunities for advancement, our staff of full-time transcribers and captioners are eligible to receive healthcare benefits ... Our transcriber community follows a ranking system based on tenure and number of hours worked, allowing freelancers to earn promotions to roles such as editor, reviewer, and supervisor."

On the subject of graphic content, the spokesperson said: "Verbit does not take on any business associated with violent or graphic content. For example, an adult entertainment company recently requested our services, but we chose not to accept them as a customer."

Growth year

Verbit's platform has wooed a healthy base of over 2,000 customers, bolstered by its acquisition of captioning provider VITAC earlier this year. In recent months, Verbit has pursued contracts with educational institutions like Harvard and Stanford, which have stricter accommodation standards than organizations in other sectors.

Auto captioning technologies on YouTube, Microsoft Teams, Google Meet, and like platforms aren't beholden to the accommodations standards outlined in the Americans with Disabilities Act. In contrast, captioning must satisfy certain accuracy criteria in order to meet federal guidelines. A recent survey conducted by Verbit found that only 14% of schools provided captions as a default, while about 10% said that they only caption lessons when a student requests it.

Verbit also says that it'll continue to explore verticals in the insurance, financial, media, and medical industries. The company -- which currently has 470 employees, a number that it expects will grow to 750 by 2023 -- recently launched a human-in-the-loop transcription service for media outlets and inked an agreement with the nonprofit Speech to Text Institute to invest in court reporting and legal transcription.

"With six times year-over-year revenue growth and close to $100 million in annual recurring revenue, Verbit continues to expand into new verticals at a hyper-growth pace. The shift to remote work and accelerated digitization amid the pandemic has been a major catalyst ... and has further driven Verbit's rapid growth," Livne added. "In today's digital era where audio and video content is a given, and many times the main method of conveying information, these AI tools are crucial to ensure that individuals and organizations of all sizes and forms can engage with their audiences and stakeholders more efficiently and effectively."

Livne previously said that Verbit plans to file for an initial public offering in 2022.