How MIT is training AI language models in an era of quality data scarcity

Improving the robustness of machine learning (ML) models for natural language tasks has become a major artificial intelligence (AI) topic in recent years. Large language models (LLMs) have always been one of the most trending areas in AI research, backed by the rise of generative AI and companies racing to release architectures that can create impressively readable content, even computer code.

Language models have traditionally been trained using online texts from sources such as Wikipedia, news stories, scientific papers and novels. However, in recent years, the tendency has been to train these models on increasing amounts of data in order to improve their accuracy and versatility.

But, according to a team of AI forecasters, there is a concern on the horizon: we may run out of data to train them on. Researchers from Epoch emphasize in a study that high-quality data generally used for training language models may be depleted as early as 2026. As developers create more sophisticated models with superior capabilities, they must gather more texts to train them on, and LLM researchers are now increasingly concerned about running out of quality data.

Kalyan Veeramachaneni, a principal research scientist in the MIT Information and Decision Systems laboratory and leader of the lab’s Data-to-AI group, may have found the solution. In a paper on Rewrite and Rollback (“R&R: Metric-Guided Adversarial Sentence Generation”) recently published in the findings of AACL-IJCNLP 2022, the proposed framework can tweak and turn low-quality data (from sources such as Twitter and 4Chan) into high-quality data (such as that from sources with editorial filters, such as Wikipedia and industry websites), increasing the amount of the correct type of data to test and train language models on.

Data scarcity looming large

Language AI researchers generally divide the data they use to train models into high-quality and low-quality data. High-quality data is generally defined as coming from sources that "have passed usefulness or quality filters" as noted by the Epoch study. In other words, it has been reviewed for editorial quality, either professionally or through peer review (in the case of scientific papers, published novels, Wikipedia, etc.) or positive engagement by many users (such as for filtered web content).

Data from low-quality categories includes non-filtered, user-generated text such as social media postings or comments on websites such as 4chan, and these instances far outweigh those rated high quality.

Training LLMs with flawed, low-quality datasets can lead to many issues:

Since ML models rely on training data to learn how to make predictions, data quality dramatically impacts the quality of the model. As a result, researchers often only train models with high-quality data, as they want their models to re-create superior language fluency. Training LLMs using high-quality text samples enables the model to understand the intricacies and complexity inherent in every language. This method has yielded outstanding results for complex language models like GPT-3.

Veeramachaneni says that aiming for a more intelligent and articulate text generation can also be helpful in training LLMs on real-life human discourse.

“Text from your average social media post, blog, etc., may not achieve this high quality, which brings down the overall quality of the training set,” Veeramachaneni told VentureBeat. “We thought, could we use existing high-quality data to train LLMs (which we now already have access to LLMs trained on high-quality data) and use those LLMs to raise the quality of the other data?”

MIT addresses current challenges in LLM development

Veeramachaneni explained that training LLMs requires massive amounts of training data and computing resources, which are only available to tech giants. This means most individual researchers must depend on the LLMs generated and released by tech giants rather than making their own.

He said that despite LLMs becoming larger and requiring more training data, the bottleneck is still computational power most of the time.

“Annotated high-quality data for downstream tasks [is] hard to obtain. Even if we design a method to create higher-quality sentences from lower-quality ones, how would we know the method did the job correctly? Asking humans to annotate data is expensive and not scalable.”

“So, R&R provides a method to use LLMs reliably to improve the quality of sentences,” he said.

Veeramachaneni believes that, in terms of model quality, current LLMs need to improve their ability to generate long documents.

“Current models can answer questions with a few sentences but cannot write a fictional story with a theme and a logical plot. Architecture improvement is necessary for LMs to handle longer text,” said Veeramachaneni. “There are also more and more concerns about the potential negative impacts of LLMs. For example, LLMs may remember personal information from the training data and leak it when generating text. This issue is hard to detect, as most LLMs are black boxes.”

Veeramachaneni and the research team in MIT’s Data-to-AI group aim to solve such issues through their Rewrite and Rollback framework.

A new method of adversarial generation from the MIT team

In the paper “R&R: Metric-Guided Adversarial Sentence Generation,” the research team proposes an adversarial framework that can generate high-quality text data by optimizing a critique score that combines fluency, similarity and misclassification metrics. R&R generates high-quality adversarial examples by capturing text data from different sources and rephrasing them, such as tweaking a sentence in various ways to develop a set of alternative sentences.

“Given 30K words in its vocabulary, it can produce an arbitrary number of sentences. Then it winnows these down to the highest-quality sentences in terms of grammatical quality, fluency and semantic similarity to the original sentence,” Veeramachaneni told VentureBeat.

_{The R&R Framework, Image source: MIT.}

To do this, it use an LLM trained on high-quality sentences to remove sentences that need to be grammatically correct or fluent. First, it attempts to rewrite the whole sentence, with no limitation on how many words are changed; then it tries to roll back some edits to achieve a minimal set of modifications.

“Because text classifiers generally need to be trained on human-labeled data, they are often trained with small datasets, meaning they can easily be fooled and misclassify sentences. We used R&R to generate many of these sentences that could fool a text classifier and therefore could be used to train and improve it,” explained Veeramachaneni.

It’s also possible to use R&R to transform a low-quality or poorly written sentence into a better-quality sentence. Such a method can have several applications, from editing assistance for human writing to creating more data for LLMs.

_{Image source: MIT.}

The stochastic rewrite feature allows the tool to explore a larger text space, and the rollback feature allows it to make meaningful changes with minimal edits. This feature is powerful because it explores many options and can find multiple different adversarial examples for the same sentence. As a result, R&R can generate fluent sentences that are semantically similar to a target sentence without human intervention.

“The primary use case of R&R is to conduct adversarial attacks on text classifiers,” said Veeramachaneni. “Given a sentence, it can find similar sentences where the classifier misclassified. R&R-generated sentences can help expand these training sets, thus improving text classifiers’ quality, which may also increase their potential applications.”

Talking about the challenges faced while developing the R&R model, Veeramachaneni told VentureBeat that traditional methods for finding alternative sentences stick to changing one word at a time. When designing the rewrite step, the team initially developed the technique to mask only one word — that is, to change one word at a time. Doing so, they found that this led to a change of meaning from that of the original sentence.

“Such a design led to the model getting stuck because there are not many options for a single masked position,” he said. “We overcome this by masking multiple words in each step. This new design also enabled the model to change the length of the text. Hence we introduced the rollback step, which eliminates unnecessary perturbations/changes.”

The research team says that R&R can also help people change their writing in pursuit of a specific goal: for instance, it can be used to make a sentence more persuasive, more concise, etc. Both automatic and human evaluation of the R&R framework showed that the proposed method succeeds in optimizing the automatic similarity and fluency metrics to generate adversarial examples of higher quality than previous methods.

The future of LLMs and generative AI

Veeramachaneni believes that LLMs will push the boundaries for human discourse in the near future and hopes to see more applications of LLMs in 2023.

“LLMs will be able to quickly and easily summarize and provide existing information. As a result, what we write and our interactions with each other will have to be more meaningful and insightful. It is progress,” he said.

Veeramachaneni further explained that LLMs are currently only being used to summarize text or answer questions, but there are many more possible applications.

“As the potential of these tools is continually realized, we expect a usage boom. The recent release of ChatGPT by OpenAI has demonstrated good text-generation capability. We can expect tech giants to compete on larger models and release larger models with better performance,” said Veeramachaneni.

“At the same time, we expect serious evaluations of LLMs’ limitations and vulnerabilities. It is clear that LLMs can produce meaningful, readable sentences. Now, we expect people to begin focusing on evaluating the factual information contained in the generated text.”

Data scarcity looming large

MIT addresses current challenges in LLM development

A new method of adversarial generation from the MIT team

The future of LLMs and generative AI

More