Google sets the bar for AI language models with PaLM

Google’s new large language model (LLM) called PaLM (Pathways Language Model) is the first outcome of Pathways, Google’s new AI architecture, which aims to handle many tasks at once, learn new tasks quickly and reflect a better understanding of the world.

PaLM is a massive undertaking with ambitious goals. Although many aspects of PaLM require further evaluation, it represents an important step forward for LLMs. The process of developing and evaluating PaLM is detailed in an arXiv publication and summarized by Google in a blog post.

Under the LLM hood

Google’s publication outlines the philosophy of Pathways at every step of the process of training PaLM. The versions of the new architecture include PaLM 8B with 8 billion parameters, PaLM 62B with 62 billion parameters and PaLM 540B with 540 billion parameters. Google created different versions in order to evaluate the cost-value function as well as the benefits of scale.

The number of parameters is important in LLMs, although more parameters don’t necessarily translate to a better-performing model. PaLM 540B is in the same league as some of the largest LLMs available regarding the number of parameters: OpenAI’s GPT-3 with 175 billion, DeepMind’s Gopher and Chinchilla with 280 billion and 70 billion, Google’s own GLaM and LaMDA with 1.2 trillion and 137 billion and Microsoft - Nvidia’s Megatron–Turing NLG with 530 billion.

The first thing to consider when discussing LLMs, like any other AI model, is the efficiency of the training process. Even the Googles of the world need to answer this question: “Given a certain quantity of compute, how large of a model should I train in order to get the best possible performance?”

In 2020, OpenAI proposed scaling laws to guide the training of LLMs. In 2022, DeepMind published a paper, “Training Compute-Optimal Large Language Models,” in which analysts claim that training LLMs has been done with a deeply suboptimal use of compute. Independently, Google reached similar conclusions, as detailed in PaLM’s documentation.

PaLM’s training is state of the art on many levels. At the hardware level, PaLM 540B was trained over two TPU v4 Pods connected over a data center network (DCN) using a combination of model and data parallelism. Google used 3,072 TPU v4 chips in each Pod attached to 768 hosts, which it notes is the largest TPU configuration described to date. This allowed Google to efficiently scale training to 6,144 chips, achieving a training efficiency of 57.8% hardware FLOPs utilization, which Google claims is the highest yet achieved for LLMs at this scale.

PaLM uses a standard Transformer model architecture, with some customizations. Transformer is the architecture used by all LLMs and although PaLM deviates from it in some ways, what is arguably more important is the focus of the training dataset used.

How to train your LLM

The dataset used to train PaLM is a mixture of filtered multilingual web pages (27%), English books (13%), multilingual Wikipedia articles (4%), English news articles (1%), GitHub source code (5%) and multilingual social media conversations (50%). This dataset is based on those used to train LaMDA and GLaM. There are a few things worth highlighting here.

First, it’s worth asking whether the selection of sources reflects Google’s goals. Social media conversations are by far the most prevalent source and while web pages have been selected taking their assigned quality scores into account, that doesn’t seem to be the case for social media conversations.

Web pages included in the training dataset were filtered using a classifier to assess quality, with the goal of limiting content toxicity and including professionally written content. However, Google notes, this may have disproportionately excluded casual language, code-switching (or behavioral adjustments in actions or speech) or dialectal diversity, and may limit PaLM’s capability to model the nondominant dialects across the English-speaking regions globally.

We hypothesize that quality scores may be harder to assign for social media conversations. The paper also argues that in order for PaLM to be able to identify toxicity as part of its general-purpose applicability, exposure to it is needed.

Second, even though multilingual sources are cited, in reality they’re still dominated by the English language. Nearly 78% of all sources are English, with German and French sources at 3.5% and 3.2% and all other sources trailing far behind.

Google notes that the language capabilities of PaLM are likely constrained by the limitations of language present in the training data and evaluation benchmarks. At the same time, PaLM yields impressive multilingual capabilities on the benchmarks Google evaluated against, the majority of which are in the English language.

Variations of PaLM were trained using one-pass or few-pass approaches, which means that the bulk of the data in the training dataset were processed as input as few times as possible. This is part of the efficiency bet for PaLM, but it also had another interesting side effect: it resulted in very little memorization, meaning that PaLM output is for the most part computed, not recited.

Doing more with less — but what for?

Google’s vision for Pathways is to “enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data and to do so with remarkable efficiency.” PaLM may be an important step forward regarding efficiency, but what about its performance levels?

Google claims that PaLM shows breakthrough capabilities on numerous difficult tasks. In its blog post, examples for language understanding and generation, reasoning, and code-related tasks are highlighted.

In language understanding, PaLM was evaluated on 29 widely used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior LLMs on 28 of 29 tasks. In addition to English NLP tasks, PaLM also shows strong performance on multilingual NLP benchmarks, including translation, even though only 22% of the training corpus is non-English.

PaLM’s performance was also compared against that of Gopher and Chinchilla using the new Beyond the Imitation Game Benchmark (BIG-bench). Results demonstrate impressive natural language understanding and generation capabilities on tasks like distinguishing cause and effect, understanding conceptual combinations in appropriate contexts and even guessing a movie from a combination of emojis.

Of note here is the fact that PaLM 540B five-shot performs better than the average result from individuals who were asked to solve the same tasks. Google also notes that PaLM’s performance suggests that performance improvements from scale have not yet plateaued.

As for reasoning, PaLM’s performance was evaluated in tasks that require multistep arithmetic or commonsense reasoning. The example highlighted by Google is PaLM’s capability to solve 58% of the problems in GSM8K, a benchmark of thousands of challenging grade-school level math questions.

PaLM outperforms the prior top score of 55% achieved by fine-tuning GPT-3 with a training set of 7,500 problems and combining it with an external calculator and verifier. This new score also approaches the 60% average of problems solved by 9- to 12-year-olds — the target audience for the question set.

Google’s results for PaLM 540B show strong performance across coding tasks and natural language tasks in a single model, even though it has only 5% code in the pre-training dataset. Google notes that PaLM’s few-shot performance is especially remarkable because it is on par with the fine-tuned Codex while using 50 times less Python code for training.

To summarize, it seems that PaLM can do more with less — i.e., achieve comparable or better performance to existing state-of-the-art LLMs, while needing fewer resources and less customization than they do.

Aiming higher with AI ethics and human-level intelligence

The fact that this is a gigantic undertaking is clear from Google’s publication detailing the new technology. Its size, level of detail, and mention of a team of nearly 70 professionals involved in the effort speak volumes.

Google also includes sections on “Representational Bias Analysis” and “Ethical Considerations” in its publication. Analysis and documentation of potential undesirable risks through transparent artifacts such as model cards and data sheets, which also include information on intended use and testing, is promoted.

It’s hard to offer prognostications as to what that all means on a practical level for the rest of the world at this point. Being able to create LLMs in a more efficient way is a good thing — to the extent that they are created at all.

However, we’re not aware of plans to share PaLM at this point and the TPU infrastructure used to train it is Google-specific. That means transfer of know-how and techniques to other LLM builders may not be directly applicable.

Contrary to GPT-3, which is commercially available by OpenAI together with Microsoft via an API, we’re not aware of similar programs or plans for Google’s GLaM, LaMDA and PaLM. Google’s BERT, one of the first LLMs, is open source and has given birth to many variations, in addition to powering the latest incarnation of Google Search. We can hypothesize that PaLM may eventually get there, too.

As to the pie-in-the-sky goal of human-level intelligence, opinions vary. Google notes in its publication that performance improvements from scale haven’t yet plateaued. In other areas where deep learning is applied, however, a plateau in performance seems to have been reached.

Recently, Blaise Aguera y Arcas, the head of Google’s AI group in Seattle, argued that “statistics do amount to understanding”, citing a few exchanges with LaMDA as evidence. It did not take long for critics to point out weaknesses in that claim. If anything, we expect PaLM to fuel the ongoing debate among AI professionals and technical decision makers.

Under the LLM hood

How to train your LLM

Doing more with less — but what for?

Aiming higher with AI ethics and human-level intelligence

More