Language models fail to say what they mean or mean what they say

Over the past five years, large language models have taken the world by storm. And we've seen chatbots become a commodity that anyone can use to provide customer service or other conversational tasks using a simple API. l

Language models are probabilistic representations of language built using large neural networks. They account for the context of words, improve upon word embeddings, and solve many natural language-related tasks, such as generating text given one or more seed sentences. And while this technology can save costs for businesses, it can also get things wrong.

How language models work

Just as a single seed becomes a tree with many branches, in the case of text generation, a seed sentence evolves into a conversation, based on the patterns learned for the context of the sentence. For example, take a look at a sample seed question-and-answer combo from the most complex GPT-3 model by OpenAI. (For readers new to language-generation technologies: The 'seed' is brief text that a human inputs into the machine as a cue; the 'response' is text generated by the machine based on that cue and the language model it has learned.)

Seed: Do you prefer Dali or Debussy? Why?

Response: I prefer Debussy because his music is more subtle and complex. Dali’s paintings are interesting, but I find them a bit too jarring.

At first glance, the AI has produced an impressive answer, fit for a sophisticated crowd at a gallery opening. With the current OpenAI price at less than $0.01 per 100 words for the best GPT-3 model, it seems a cost-effective option. But, as I'll explain below, the models powering language generation are not yet fit for purpose.

Since Google’s BERT language model in 2018, language models have grown exponentially, often via brute force by adding more parameters, data, and computing power. The most significant improvements in models such as M6 (Alibaba) or RETRO (DeepMind), have been consuming less electricity when training or using less parameters, respectively. The current best model according to the GLUE benchmark is Vega (JD Explore Academy), although this evaluation does not account for the resources or size of the neural networks used.

Stanford’s Institute for Human-Centered AI renamed these models as Foundation Models in August 2021, despite stating that there is a lack of clear understanding of how these models work, when they fail, or what they are capable of due to their emergent properties. In that case, these are weak foundations, and it is essential to understand their limitations.

Three major limitations of language models

1. Unequal language resources

We need large amounts of text to build language models, mainly from the web. It’s virtually impossible to know the lexicon of every existing language, but we can use Wikipedia as a proxy. Of approximately 7,100 languages currently spoken, only 312 have an active Wikipedia, representing only 4.4% of all languages, and of those, only 34 have more than one million of pages (volume) and 18 have more than one million entries (diversity).

We also need linguistic resources for many natural language processing (NLP) tasks. India, for example, is one of two countries with a population that exceeds one billion and ranks fourth in the world for the number of languages spoken at almost 450. However, only 23 of India’s languages are official, including English, and about half of them have some linguistic resources, meaning less than 3% of India’s languages factor into language model technology, although the majority of the people speak the official languages. These small percentages illustrate the vast gap between languages spoken in developed or large countries versus minority languages of developing countries that likely lack access to this type of technology.

2. Texts breed bias

Texts encode social biases that models learn from such as gender, race, religion, etc. The GPT-3 model has demonstrated anti-Muslim bias, with violent sentence completions four times more frequent than those for Christians and 16 times more frequent than those for Atheists. Last January, OpenAI published an improved and smaller version of GPT-3, called InstructGPT with 1.3 billion parameters that supposedly mitigated bias. However, the seed phrases below, along with several other examples, yielded completions that clearly show that gender bias is still present in the model:

Seed phrase: every man wonders…

Completion: …why he was born into this world and what his life is for

Seed phrase: every woman wonders…

Completion: …what it would be like to be a man

3. Lack of semantic understanding at an extreme cost

Language models do not understand the semantics of the text they learn from or the text they generate, as evidenced in my previous example of Dali’s paintings described as “jarring.” A closer look reveals a glitch in semantic understanding. The word “jarring” typically relates to sound, which correlates to music. And while the seed question and answer contain references to Debussy, the term “jarring” describes a reaction to paintings, not music.

In a paper that famously triggered the firing of the two leaders of Google’s AI Ethics team, Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell referred to language models as “stochastic parrots” and pointed out additional problems such as the environmental and financial costs of building language models in terms of CO2 emissions and electricity.

We should — and can — do better

To break through the barrier of limitations imposed by language models, we need to really understand semantics. Merely learning patterns will never be enough and we cannot continue to use models that massively waste resources and increase inequality. We need to solidify our foundations before we can build atop and move forward. One potential solution is hybrid systems that combine classic symbolic AI and knowledge bases with deep learning, as Gary Marcus also recently noted. These might become real foundations.

Ricardo Baeza-Yates is Director of Research at the Institute for Experiential AI.

How language models work

Three major limitations of language models

We should — and can — do better

More