How will OpenAI’s Whisper model impact AI applications?

Last week, OpenAI released Whisper, an open-source deep learning model for speech recognition. OpenAI’s tests on Whisper show promising results in transcribing audio not only in English, but also in several other languages.

Developers and researchers who have experimented with Whisper are also impressed with what the model can do. However, what is perhaps equally important is what Whisper’s release tells us about the shifting culture in artificial intelligence (AI) research and the kind of applications we can expect in the future.

A return to openness?

OpenAI has been much criticized for not open-sourcing its models. GPT-3 and DALL-E, two of OpenAI’s most impressive deep learning models, are only available behind paid API services, and there is no way to download and examine them.

In contrast, Whisper was released as a pretrained, open-source model that everyone can download and run on a computing platform of their choice. This latest development comes as the past few months have seen a trend toward more openness among commercial AI research labs.

In May, Meta open-sourced OPT-175B, a large language model (LLM) that matches GPT-3 in size. In July, Hugging Face released BLOOM, another open-source LLM of GPT-3 scale. And in August, Stability.ai released Stable Diffusion, an open-source image generation model that rivals OpenAI’s DALL-E.

Open-source models can open new windows for performing research on deep learning models and helping create specialized applications.

OpenAI's Whisper embraces data diversity

One of the important characteristics of Whisper is the diversity of data used to train it. Whisper was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. A third of the training data is composed of non-English audio examples.

“Whisper can robustly transcribe English speech and perform at a state-of-the-art level with approximately 10 languages – as well as translation from those languages into English,” a spokesperson for OpenAI told VentureBeat in written comments.

While the lab’s analysis of languages other than English is not comprehensive, users who have tested it report solid results.

Again, data diversity has become a popular trend in the AI research community. BLOOM, released this year, was the first language model to support 59 languages. And Meta is working on a model that supports translation across 200 languages.

The move toward more data and language diversity will make sure that more people can access and benefit from advances in deep learning.

Run your own model

As Whisper is open source, developers and users can choose to run it on the computation platform of their choice, whether it is their laptop, desktop workstation, mobile device or cloud server. OpenAI released five different sizes of Whisper, each trading off accuracy for speed proportionately, with the tiniest model being approximately 60 times faster than the largest.

“Since transcription using the largest Whisper model runs faster than real time on an [Nvidia] A100 [GPU], I expect there are practical use cases to run smaller models on mobile or desktop systems, once the models are properly ported to the respective environments,” the OpenAI spokesperson said. “This would allow the users to run automatic speech recognition (ASR) without the privacy concerns of uploading their voice data to the cloud, while it may drain more battery and have increased latency compared to the alternative ASR solutions.”

Developers who have tried Whisper are satisfied with the opportunities that it can provide. And it can pose challenges to cloud-based ASR services that have been the main option until now.

“At first glance, Whisper appears to be much better than other SaaS [software-as-a-service] products in accuracy,” MLops expert Noah Gift told VentureBeat. “Since it is free and programmable, it most likely means a very significant challenge to services that only offer transcribing.”

Gift ran the model on his computer to transcribe hundreds of MP4 files ranging from 10 minutes to hours. For machines with Nvidia GPUs, it may be much more cost-effective to run the model locally and sync the results to the cloud, Gift says.

“Many content creators that have some programming experience who weren't initially using transcription services due to cost will immediately adopt Whisper into their workflow,” Gift said.

Gift is now using Whisper to automate transcription in his workflow. And with automated transcription, he has the possibility of using other open-source language models, such as text summarizers.

“Content creators from indie to major film studios can use this technology and it has the possibility of being one of the tools in a tipping point in adding AI to our everyday workflows,” Gift said. “By making transcription a commodity, now the real AI revolution can begin for those in the content space — from YouTubers, to News to Feature Film (all industries I have worked professionally in).”

Create your own applications

There are already several initiatives to make Whisper easier to use for people who don’t have the technical skills to set up and run machine learning models. An example is a joint project by journalist Peter Sterne and GitHub engineer Christina Warren to create a “free, secure, and easy-to-use transcription app for journalists” based on Whisper.

Meanwhile, open-source models like Whisper open new possibilities in the cloud. Developers are using platforms like Hugging Face to host Whisper and make it available through API calls.

“It takes a company 10 minutes to create their own transcription service powered by Whisper, and start transcribing calls or audio content even at high scale,” Jeff Boudier, growth and product manager at Hugging Face, told VentureBeat.

There are already several Whisper-based services on Hugging Face, including a YouTube transcription app.

Or fine-tune existing applications for your purposes

And another benefit of open-source models like Whisper is fine-tuning — the process of taking a pretrained model and optimizing it for a new application. For example, Whisper can be fine-tuned to improve ASR performance in a language that is not well-supported in the current model. Or it can be fine-tuned to better recognize medical or technical terms. Another interesting direction could be to fine-tune the model for other tasks than ASR, such as speaker verification, sound event detection and keyword spotting.

“It could be fascinating to see where this heads,” Gift said. “For very technical verticals, a fine-tuned version could be a game changer in how they are able to communicate technical information. For example, could this be the start of a revolution in medicine as primary care physicians could have their dialogue recorded and then eventually automated into AI systems that diagnose patients?”

“We have already received feedback that you can use Whisper as a plug-and-play service to achieve better results than before,” Philipp Schmid, technical lead at Hugging Face, told VentureBeat. “Combining this with fine-tuning the model will help improve the performance even further. Especially fine-tuning for languages which were not well represented in the pretraining dataset can improve the performance significantly.”