What's next in large language model (LLM) research? Here's what's coming down the ML pike

There is a lot of excitement around the potential applications of large language models (LLM). We’re already seeing LLMs used in several applications, including composing emails and generating software code.

But as interest in LLMs grows, so do concerns about their limits; this can make it difficult to use them in different applications. Some of these include hallucinating false facts, failing at tasks that require commonsense and consuming large amounts of energy.

Here are some of the research areas that can help address these problems and make LLMs available to more domains in the future.

Knowledge retrieval

One of the key problems with LLMs such as ChatGPT and GPT-3 is their tendency to “hallucinate.” These models are trained to generate text that is plausible, not grounded in real facts. This is why they can make up stuff that never happened. Since the release of ChatGPT, many users have pointed out how the model can be prodded into generating text that sounds convincing but is factually incorrect.

One method that can help address this problem is a class of techniques known as “knowledge retrieval.” The basic idea behind knowledge retrieval is to provide the LLM with extra context from an external knowledge source such as Wikipedia or a domain-specific knowledge base.

Google introduced “retrieval-augmented language model pre-training” (REALM) in 2020. When a user provides a prompt to the model, a “neural retriever” module uses the prompt to retrieve relevant documents from a knowledge corpus. The documents and the original prompt are then passed to the LLM, which generates the final output within the context of the knowledge documents.

Work on knowledge retrieval continues to make progress. Recently, AI21 Labs presented “in-context retrieval augmented language modeling,” a technique that makes it easy to implement knowledge retrieval in different black-box and open-source LLMs.

You can also see knowledge retrieval at work in You.com and the version of ChatGPT used in Bing. After receiving the prompt, the LLM first creates a search query, then retrieves documents and generates its output using those sources. It also provides links to the sources, which is very useful for verifying the information that the model produces. Knowledge retrieval is not a perfect solution and still makes mistakes. But it seems to be one step in the right direction.

Better prompt engineering techniques

Despite their impressive results, LLMs do not understand language and the world — at least not in the way that humans do. Therefore, there will always be instances where they will behave unexpectedly and make mistakes that seem dumb to humans.

One way to address this challenge is “prompt engineering,” a set of techniques for crafting prompts that guide LLMs to produce more reliable output. Some prompt engineering methods involve creating “few-shot learning” examples, where you prepend your prompt with a few similar examples and the desired output. The model uses these examples as guides when producing its output. By creating datasets of few-shot examples, companies can improve the performance of LLMs without the need to retrain or fine-tune them.

Another interesting line of work is “chain-of-thought (COT) prompting,” a series of prompt engineering techniques that enable the model to produce not just an answer but also the steps it uses to reach it. CoT prompting is especially useful for applications that require logical reasoning or step-by-step computation.

There are different CoT methods, including a few-shot technique that prepends the prompt with a few examples of step-by-step solutions. Another method, zero-shot CoT, uses a trigger phrase to force the LLM to produce the steps it reaches the result. And a more recent technique called “faithful chain-of-thought reasoning” uses multiple steps and tools to ensure that the LLM’s output is an accurate reflection of the steps it uses to reach the results.

Reasoning and logic are among the fundamental challenges of deep learning that might require new architectures and approaches to AI. But for the moment, better prompting techniques can help reduce the logical errors LLMs make and help troubleshoot their mistakes.

Alignment and fine-tuning techniques

Fine-tuning LLMs with application-specific datasets will improve their robustness and performance in those domains. Fine-tuning is especially useful when an LLM like GPT-3 is deployed in a specialized domain where a general-purpose model would perform poorly.

New fine-tuning techniques can further improve the accuracy of models. Of note is “reinforcement learning from human feedback” (RLHF), the technique used to train ChatGPT. In RLHF, human annotators vote on the answers of a pre-trained LLM. Their feedback is then used to train a reward system that further fine-tunes the LLM to become better aligned with user intents. RLHF worked very well for ChatGPT and is the reason that it is so much better than its predecessors in following user instructions.

The next step for the field will be for OpenAI, Microsoft and other providers of LLM platforms to create tools that enable companies to create their own RLHF pipelines and customize models for their applications.

Optimized LLMs

One of the big problems with LLMs is their prohibitive costs. Training and running a model the size of GPT-3 and ChatGPT can be so expensive that it will make them unavailable for certain companies and applications.

There are several efforts to reduce the costs of LLMs. Some of them are centered around creating more efficient hardware, such as special AI processors designed for LLMs.

Another interesting direction is the development of new LLMs that can match the performance of larger models with fewer parameters. One example is LLaMA, a family of small, high-performance LLMs developed by Facebook. LLaMa models are accessible for research labs and organizations that don’t have the infrastructure to run very large models.

According to Facebook, the 13-billion parameter version of LLaMa outperforms the 175-billion parameter version of GPT-3 on major benchmarks, and the 65-billion variant matches the performance of the largest models, including the 540-billion parameter PaLM.

While LLMs have many more challenges to overcome, it will be interesting how these developments will help make them more reliable and accessible to the developer and research community.

Knowledge retrieval

Better prompt engineering techniques

Alignment and fine-tuning techniques

Optimized LLMs

More