Large language models aren't always more complex

Language models such as OpenAI's GPT-3, which leverage AI techniques and large amounts of data to learn skills like writing text, have received an increasing amount of attention from the enterprise in recent years. From a qualitative standpoint, the results are good -- GPT-3 and models inspired by it can write emails, summarize text, and even generate code for deep learning in Python. But some experts aren't convince the size of these models -- and their training datasets -- correspond to performance.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, it's an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

"The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets," Antoniak told VentureBeat in a previous interview. "These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate."

Parameter count

Conventional wisdom once held that the more parameters a model had, the more complex tasks it could accomplish. In machine learning, parameters are internal configuration variables that a model uses when making predictions, and their values essentially define the model's skill on a problem.

But a growing body of research casts doubt on this notion. This week, a team of Google researchers published a study claiming that a model far smaller than GPT-3 -- fine-tuned language net (FLAN) -- bests GPT-3 "by a large margin" on a number of challenging benchmarks. FLAN, which has 137 billion parameters compared with GPT-3's 175 billion, outperformed GPT-3 on 19 out of the 25 tasks the researchers tested it on and even surpassed GPT-3's performance on 10 tasks.

FLAN differs from GPT-3 in that it's fine-tuned on 60 natural language processing tasks expressed via instructions like "Is the sentiment of this movie review positive or negative?" and "Translate 'how are you' into Chinese." According to the researchers, this "instruction tuning" improves the model's ability to respond to natural language prompts by "teaching" it to perform tasks described via the instructions.

After training FLAN on a collection of web pages, programming languages, dialogs, and Wikipedia articles, the researchers found that the model could learn to follow instructions for tasks it hadn't been explicitly trained to do. Despite the fact that the training data wasn't as "clean" as GPT-3's training set, FLAN still managed to surpass GPT-3 on tasks like answering questions and summarizing long stories.

"The performance of FLAN compares favorably against both zero-shot and few-shot GPT-3, signaling the potential ability for models at scale to follow instructions," the researchers wrote. "We hope that our paper will spur further research on zero-shot learning and using labeled data to improve language models."

Dataset difficulties

As alluded to in the Google study, the problem with large language models may lie in the data used to train them -- and in common training techniques. For example, scientists at the Institute for Artificial Intelligence at the Medical University of Vienna, Austria found that GPT-3 underperforms in domains like biomedicine compared with smaller, less architecturally complex but carefully fine-tuned models. Even when pretrained on biomedical data, large language models struggle to answer questions, classify text, and identify relationships on par with highly tuned models "orders of magnitude" smaller, according to the researchers.

"Large language models [can't] achieve performance scores remotely competitive with those of a language model fine-tuned on the whole training data," the Medical University of Vienna researchers wrote. "The experimental results suggest that, in the biomedical natural language processing domain, there is still much room for development of multitask language models that can effectively transfer knowledge to new tasks where a small amount of training data is available."

It could come down to data quality. A separate paper by Leo Gao, data scientist at the community-driven project EleutherAI, implies that the way data in a training dataset is curated can significantly impact the performance of large language models. While it's widely believed that using a classifier to filter data from "low-quality sources" like Common Crawl improves training data quality, over-filtering can lead to a decrease in GPT-like language model performance. By optimizing too strongly for the classifier's score, the data that's retained begins to become biased in a way that satisfies the classifier, producing a less rich, diverse dataset.

"While intuitively it may seem like the more data is discarded the higher quality the remaining data will be, we find that this is not always the case with shallow classifier-based filtering. Instead, we find that filtering improves downstream task performance up to a point, but then decreases performance again as the filtering becomes too aggressive," Gao wrote. "[We] speculate that this is due to Goodhart's law, as the misalignment between proxy and true objective becomes more significant with increased optimization pressure."

Looking ahead

Smaller, more carefully tuned models could solve some of the other problems associated with large language models, like environmental impact. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

GPT-3 used 1,287 megawatts during training and produced 552 metric tons of carbon dioxide emissions, a Google study found. By contrast, FLAN used 451 megawatts and produced 26 metrics tons of carbon dioxide.

As the coauthors of a recent MIT paper wrote, training requirements will become prohibitively costly from a hardware, environmental, and monetary standpoint if the trend of large language models continues. Hitting performance targets in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the gain is a net positive.

Parameter count

Dataset difficulties

Looking ahead

More