We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
In just a short number of years, deep learning algorithms have evolved to be able to beat the world’s best players at board games and recognize faces with the same accuracy as a human (or perhaps even better). But mastering the unique and far-reaching complexities of human language has proven to be one of AI’s toughest challenges.
Could that be about to change?
The ability for computers to effectively understand all human language would completely transform how we engage with brands, businesses, and organizations across the world. Nowadays most companies don’t have time to answer every customer question. But imagine if a company really could listen to, understand, and answer every question — at any time on any channel? My team is already working with some of the world’s most innovative organizations and their ecosystem of technology platforms to embrace the huge opportunity that exists to establish one-to-one customer conversations at scale. But there’s work to do.
It took until 2015 to build an algorithm that could recognize faces with an accuracy comparable to humans. Facebook’s DeepFace is 97.4% accurate, just shy of the 97.5% human performance. For reference, the FBI’s facial recognition algorithm only reaches 85% accuracy, meaning it is still wrong in more than one out of every seven cases.
The FBI algorithm was handcrafted by a team of engineers. Each feature, like the size of a nose and the relative placement of your eyes was manually programmed. The Facebook algorithm works with learned features instead. Facebook used a special deep learning architecture called Convolutional Neural Networks that mimics how the different layers in our visual cortex process images. Because we don’t know exactly how we see, the connections between these layers are learned by the algorithm.
Facebook was able to pull this off because it figured out how to get two essential components of a human-level AI in place: an architecture that could learn features, and high quality data labelled by millions of users that had tagged their friends in the photos they shared.
Language is in sight
Vision is a problem that evolution has solved in millions of different species, but language seems to be much more complex. As far as we know, we are currently the only species that communicates with a complex language.
Less than a decade ago, to understand what text is about AI algorithms would only count how often certain words occurred. But this approach clearly ignores the fact that words have synonyms and only mean something if they are within a certain context.
In 2013, Tomas Mikolov and his team at Google discovered how to create an architecture that is able to learn the meaning of words. Their word2vec algorithm mapped synonyms on top of each other, it was able to model meaning like size, gender, speed, and even learn functional relations like countries and their capitals.
The missing piece, however, was context. The real breakthrough in this field came in 2018, when Google introduced the BERT model. Jacob Devlin and team recycled an architecture typically used for machine translation and made it learn the meaning of a word in relation to its context in a sentence.
By teaching the model to fill out missing words in Wikipedia articles, the team was able to embed language structure in the BERT model. With only a limited amount of high-quality labelled data, they were able to finetune BERT for a multitude of tasks ranging from finding the right answer to a question to really understanding what a sentence is about. They were the first to really nail the two essentials for language understanding: the right architecture and large amounts of high-quality data to learn from.
In 2019, researchers at Facebook were able to take this even further. They trained a BERT-like model on more than 100 languages simultaneously. The model was able to learn tasks in one language, for example, English, and use it for the same task in any of the other languages, such as Arabic, Chinese, and Hindi. This language-agnostic model has the same performance as BERT on the language it is trained on and there is only a limited impact going from one language to another.
All these techniques are really impressive in their own right, but in early 2020 researchers at Google were finally able to beat human performance on a broad range of language understanding tasks. Google pushed the BERT architecture to its limits by training a much larger network on even more data. This so-called T5 model now performs better than humans in labelling sentences and finding the right answers to a question. The language-agnostic mT5 model released in October is almost as good as bilingual humans at switching from one language to another, but it can do so with 100+ languages at once. And the trillion-parameter model Google announced this week makes the model even bigger and more powerful.
Imagine chat bots that can understand what you write in any imaginable language. They will actually comprehend the context and remember past conversations. All the while you will get answers that are no longer generic but really to the point.
Search engines will be able to understand any question you have. They will produce proper answers and you won’t even have to use the right keywords. You will get an AI colleague that knows all there is to know about your company’s procedures. No more questions from customers that are just a Google search away if you know the right lingo. And colleagues that wonder why people didn’t read all the company documents will become a thing of the past.
A new era of databases will emerge. Say goodbye to the tedious work of structuring your data. Any memo, email, report, etc., will be automatically interpreted, stored, and indexed. You’ll no longer need your IT department to run queries to create a report. Just tell the database what you want to know.
And that is just the tip of the iceberg. Any procedure that currently still requires a human to understand language is now at the verge of being disrupted or automated.
Talk isn’t cheap
There is a catch here. Why aren’t we seeing these algorithms everywhere? Training the T5 algorithm costs around $1.3 million in cloud compute. Luckily the researchers at Google were kind enough to share these models. But you can’t use these models for anything specific without fine-tuning them on the task at hand. So even this is a costly affair. And once you have optimized these models for your specific problem, they still require a lot of compute power and a long time to execute.
Over time, as companies invest in these fine-tuning efforts, we will see limited applications emerge. And, if we trust Moore’s Law, we could see more complex applications in about five years. But new models will also emerge to outperform the T5 algorithm.
At the beginning of 2021, we are now in touching distance of AI’s most significant breakthrough and the endless possibilities this will unlock.
Pieter Buteneers is Director of Engineering in Machine Learning and AI at Sinch.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.