LLMs have not learned our language — we’re trying to learn theirs

Large language models (LLMs) are currently a red-hot area of research in the artificial intelligence (AI) community. Scientific progress in LLMs in the past couple of years has been nothing short of impressive, and at the same time, there is growing interest and momentum to create platforms and products powered by LLMs.

However, in tandem with advances in the field, the shortcomings of large language models have also become evident. Many experts agree that no matter how large LLMs and their training datasets become, they will never be able to learn and understand our language as we do.

Interestingly, these limits have given rise to a trend of research focused on studying the knowledge and behavior of LLMs. In other words, we are learning the language of LLMs and discovering ways to better communicate with them.

What LLMs can’t learn

LLMs are neural networks that have been trained on hundreds of gigabytes of text gathered from the web. During training, the network is fed with text excerpts that have been partially masked. The neural network tries to guess the missing parts and compares its predictions with the actual text. By doing this repeatedly and gradually adjusting its parameters, the neural network creates a mathematical model of how words appear next to each other and in sequences.

After being trained, the LLM can receive a prompt and predict the words that come after it. The larger the neural network, the more learning capacity the LLM has. The larger the dataset (given that it contains well-curated and high-quality text), the greater chance that the model will be exposed to different word sequences and the more accurate it becomes in generating text.

However, human language is about much more than just text. In fact, language is a compressed way to transmit information from one brain to another. Our conversations often omit shared knowledge, such as visual and audible information, physical experience of the world, past conversations, our understanding of the behavior of people and objects, social constructs and norms, and much more.

As Yann LeCun, VP and chief AI scientist at Meta and award-winning deep learning pioneer, and Jacob Browning, a post-doctoral associate in the NYU Computer Science Department, wrote in a recent article, “A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.”

The two scientists note, however, that LLMs “will undoubtedly seem to approximate [human intelligence] if we stick to the surface. And, in many cases, the surface is enough.”

The key is to understand how close this approximation is to reality, and how to make sure LLMs are responding in the way we expect them to. Here are some directions of research that are shaping this corner of the widening LLM landscape.

Teaching LLMs to express uncertainty

In most cases, humans know the limits of their knowledge (even if they don’t directly admit it). They can express uncertainty and doubt and let their interlocutors know how confident they are in the knowledge they are passing. On the other hand, LLMs always have a ready answer for any prompt, even if their output doesn’t make sense. Neural networks usually provide numerical values that represent the probability that a certain prediction is correct. But for language models, these probability scores do not represent the LLM’s confidence in the reliability of its response to a prompt.

A recent paper by researchers at OpenAI and the University of Oxford shows how this shortcoming can be remedied by teaching LLMs “to express their uncertainty in words.”

They show that LLMs can be fine-tuned to express epistemic uncertainty using natural language, which they describe as “verbalized probability.” This is an important direction of development, especially in applications where users want to turn LLM output into actions.

The researchers suggest that expressing uncertainty can make language models honest. “If an honest model has a misinformed or malign internal state, then it could communicate this state to humans who can act accordingly,” they write.

Discovering emergent abilities of LLMs

Scale has been an important factor in the success of language models. As models become larger, not only does their performance improve on existing tasks, but they acquire the capacity to learn and perform new tasks.

In a new paper, researchers at Google, Stanford University, DeepMind, and the University of North Carolina at Chapel Hill have explored the “emergent abilities” of LLMs, which they define as abilities that “are not present in smaller models but are present in larger models.”

Emergence is characterized by the model manifesting random performance on a certain task until it reaches a certain scale threshold, after which its performance suddenly jumps and continues to improve as the model becomes larger.

The paper covers emergent abilities in several popular LLM families, including GPT-3, LaMDA, Gopher, and PaLM. The study of emergent abilities is important because it provides insights into the limits of language models at different scales. It can also help find ways to improve the capabilities of the smaller and less costly models.

Exploring the limits of LLMs in reasoning

Given the ability of LLMs to generate articles, write software code, and hold conversations about sentience and life, it is easy to think that they can reason and plan things like humans.

But a study by researchers at Arizona State University, Tempe, shows that LLMs do not acquire the knowledge and functions underlying tasks that require methodical thinking and planning, even when they perform well on benchmarks designed for logical, ethical and common-sense reasoning.

The study shows that what looks like planning and reasoning in LLMs is, in reality, pattern recognition abilities gained from continued exposure to the same sequence of events and decisions. This is akin to how humans acquire some skills (such as driving), where they first require careful thinking and coordination of actions and decisions but gradually become able to perform them without active thinking.

The researchers have established a new benchmark that tests reasoning abilities on tasks that stretch across long sequences and can’t be cheated through pattern-recognition tricks. The goal of the benchmark is to establish the current baseline and open new windows for developing planning and reasoning capabilities for current AI systems.

Guiding LLMs with better prompts

As the limits of LLMs become known, researchers find ways to either extend or circumvent them. In this regard, an interesting area of research is “prompt engineering,” a series of tricks that can improve the performance of language models on specific tasks. Prompt engineering guides LLMs by including solved examples or other cues in prompts.

One such technique is “chain-of-thought prompting” (CoT), which helps the model solve logical problems by providing a prompt that includes a solved example with intermediary reasoning steps. CoT prompting not only improves LLMs’ abilities to solve reasoning tasks, but it also gets them to output the steps they undergo to solve each problem. This helps researchers gain insights into LLMs’ reasoning process (or semblance of reasoning).

A more recent technique that builds on the success of CoT is “zero-shot chain-of-thought prompting,” which uses special trigger phrases such as “Let’s think step by step” to invoke reasoning in LLMs. The advantage of zero-shot CoT does not require the user to craft a special prompt for each task, and although it is simple, it still works well enough in many cases.

These and similar works of research show that we still have a lot to learn about LLMs, and there might be more to be discovered about the language models that have captured our fascination in the past few years.