Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.


When we think about AI and voice recognition, we typically think of one of two suboptimal scenarios. The first is your Amazon Alexa sitting at home, possibly eavesdropping on your everyday conversations and feeding advertising algorithms so you buy the right kind of lawn mower. The second scenario is clunky transcription software, auto-subtitling our videos and TV shows, often to inaccurate (and humorous) effect.

In reality, though, there are some deeply exciting developments happening in the AI voice recognition space right now. Advances in AI mean that it is now possible to create complex programs and models that can analyze and score speech. They can now even do this across a number of criteria; from grammatical accuracy to vocabulary, pronunciation to clarity.

This ability to score speech effectively has transformational power in the language-learning and education spaces. Imagine a world in which a human teacher isn’t needed to correct poor pronunciation. Imagine if that was not only possible, but was possible in real-time. The costs saved by that kind of technological development would be immense.

Looking at the latest systems, it seems that with the correct AI technology and models, any language student can theoretically receive feedback in real-time on how they are speaking — whether their English pronunciation is correct, and how or where it can be improved. This is similar, but not the same as, other AI speech applications, such as automatic speech recognition where the AI receives an audio signal and outputs the corresponding text.

An optimal system for this type of AI model requires the following five key components:

  • Audio preprocessing that handles raw audio signals coming from different platforms
  • An Artificial Neural Network (ANN) that receives a processed audio signal and produces embedded representations of the speech
  • A post-processing layer that constructs human-readable evaluation
  • An application-composer layer that maps the evaluation to product-feature needs 
  • A proprietary system that monitors the quality and performance of the production system

In order for a system to provide real-time feedback, an end-to-end latency of less than one second is probably advisable. This means that any core Artificial Neural Network only has a few milliseconds to respond, presenting a challenge in itself because this is a model with hundreds of millions of parameters, processing an arbitrarily long audio signal.

One way to counter this is to use phonemes (distinct units of sound in a language that distinguish one word from another), as the expected output, rather than graphemes, or larger language units. English has 44 phonemes: made up of 20 vowels and 24 consonant sounds. 

This enables an AI system to score and give feedback on how good a user’s sounds are, or how close they are to incorrect sounds. As an example, when a learner says “fellow”, a system can give scores, ranging from 0 to 100, on the corresponding four phonemes: /f/, /ɛ/, /l/, /əʊ/. Based on these, the platform can score the two syllables: /fɛ/ and /ləʊ/. Similarly, it could score the word, then the full sentence. In the case of imperfect pronunciation, it is able to match what it sounds most like, such as “your /ɛ/ sounded like /a/”.

These kinds of systems are increasing in popularity. Looking at the language-learning AI space, companies are able to leverage pre-trained models and invest heavily in the fine-tuning processes. Arguably, the key to the fine-tuning process and model selection lies in: 1) uniquely curated datasets, 2) using in-house knowledge about spoken English learning, and 3) engineering capabilities and deep knowledge of models’ strengths and limitations. By combining lived experience, academia and technical expertise, AI technology can be developed that provides users immediate feedback, any time they want, on how they speak English. 

In terms of deployment and production, off-the-shelf services on GCP (Google Cloud Platform) can help to minimize operational costs, while ensuring scalability and stability. To counter end-to-end latency, fine-tuning the technical infrastructure, as well as model selection, allows these kinds of technologies to give learners real-time feedback when they speak. 

For obvious reasons, these kinds of technological developments might have transformational power in the education space. As with many other verticals, one of the chief benefits of seamless AI software is the lowering of costs. In the modern era of remote and hybrid working, English language proficiency is the main barrier to landing a job with an international company, not geographical location. If software can help someone become fluent in English at a much more reasonable rate than human-to-human tuition, then they suddenly open a door into the global workforce. It is no overstatement to say that speech-recognition AI, and the language-learning potential it unlocks, could be the ultimate leveler for the international talent market. It’s now up to us to build it.

Thúy N Trần is CTO of Astrid.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers