Huawei trained the Chinese-language equivalent of GPT-3

For the better part of a year, OpenAI's GPT-3 has remained among the largest AI language models ever created, if not the largest of its kind. Via an API, people have used it to automatically write emails and articles, summarize text, compose poetry and recipes, create website layouts, and generate code for deep learning in Python. But GPT-3 has key limitations, chief among them that it's only available in English. The 45-terabyte dataset the model was trained on drew exclusively from English-language sources.

This week, a research team at Chinese company Huawei quietly detailed what might be the Chinese-language equivalent of GPT-3. Called PanGu-Alpha (stylized PanGu-α), the 750-gigabyte model contains up to 200 billion parameters -- 25 million more than GPT-3 -- and was trained on 1.1 terabytes of Chinese-language ebooks, encyclopedias, news, social media, and web pages.

The team claims that the model achieves "superior" performance in Chinese-language tasks spanning text summarization, question answering, and dialogue generation. Huawei says it's seeking a way to let nonprofit research institutes and companies gain access to pretrained PanGu-α models, either by releasing the code, model, and dataset or via APIs.

Familiar architecture

In machine learning, parameters are the part of the model that's learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well.

Large language models like OpenAI's GPT-3 learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs.

Akin to GPT-3, PanGu-α is what's called a generative pretrained transformer (GPT), a language model that is first pretrained on unlabeled text and then fine-tuned for tasks. Using Huawei's MindSpore framework for development and testing, the researchers trained the model on a cluster of 2,048 Huawei Ascend 910 AI processors, each delivering 256 teraflops of computing power.

To build the training dataset for PanGu-α, the Huawei team collected nearly 80 terabytes of raw data from public datasets, including the popular Common Crawl dataset, as well as the open web. They then filtered the data, removing documents containing fewer than 60% Chinese characters, less than 150 characters, or only titles, advertisements, or navigation bars. Chinese text was converted into simplified Chinese, and 724 potentially offensive words, spam, and "low-quality" samples were filtered out.

One crucial difference between GPT-3 and PanGu-α is the number of tokens on which the models trained. Tokens, a way of separating pieces of text into smaller units in natural language, can be either words, characters, or parts of words. While GPT-3 trained on 499 billion tokens, PanGu-α trained on only 40 billion, suggesting it's comparatively undertrained.

In experiments, the researchers say that PanGu-α was particularly adept at writing poetry, fiction, and dialog as well as summarizing text. Absent fine-tuning on examples, PanGu-α could generate poems in the Chinese forms of gushi and duilian. And given a brief conversation as prompt, the model could brainstorm rounds of "plausible" follow-up dialog.

This isn't to suggest that PanGu-α solves all of the problems plaguing language models of its size. A focus group tasked with evaluating the model's outputs found 10% of them to be "unacceptable" in terms of quality. And the researchers observed that some of PanGu-α's creations contained irrelevant, repetitive, or illogical sentences.

The PanGu-α team also didn't address some of the longstanding challenges in natural language generation, including the tendency of models to contradict themselves. Like GPT-3, PanGu-α can't remember earlier conversations, and it lacks the ability to learn concepts through further conversation and to ground entities and actions to experiences in the real world.

"The main point of excitement is the extension of these large models to Chinese," Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, told VentureBeat via email. "In other ways, it's similar to GPT-3 in both its benefits and risks. Like GPT-3, it's a huge model and can generate plausible outputs in a variety of scenarios, and so it's exciting that we can extend this to non-English scenarios ... By constructing this huge dataset, [Huawei is] able to train a model in Chinese at a similar scale to English models like GPT-3. So in sum, I'd point to the dataset and the Chinese domain as the most interesting factors, rather than the model architecture, though training a big model like this is always an engineering feat."

Skepticism

Indeed, many experts believe that while PanGu-α and similarly large models are impressive with respect to their performance, they don't move the ball forward on the research side of the equation. They're prestige projects that demonstrate the scalability of existing techniques, rather, or that serve as a showcase for a company's products.

"I think the best analogy is with some oil-rich country being able to build a very tall skyscraper," Guy Van den Broeck, an assistant professor of computer science at UCLA, said in a previous interview with VentureBeat. "Sure, a lot of money and engineering effort goes into building these things. And you do get the 'state of the art' in building tall buildings. But there is no scientific advancement per se ... I'm sure academics and other companies will be happy to use these large language models in downstream tasks, but I don't think they fundamentally change progress in AI."

Even OpenAI's GPT-3 paper hinted at the limitations of merely throwing more compute at problems in natural language. While GPT-3 completes tasks from generating sentences to translating between languages with ease, it fails to perform much better than chance on a test -- adversarial natural language inference -- that tasks it with discovering relationships between sentences.

The PanGu-α team makes no claim that the model overcomes other blockers in natural language, like answering math problems correctly or responding to questions without paraphrasing training data. More problematically, their experiments didn't probe PanGu-α for the types of bias and toxicity found to exist in models like GPT-3. OpenAI itself notes that GPT-3 places words like "naughty" or "sucked" near female pronouns and "Islam" near terms like "terrorism." A separate paper by Stanford University Ph.D. candidate and Gradio founder Abubakar Abid details the inequitable tendencies of text generated by GPT-3, like associating the word "Jews" with "money."

Carbon impact

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who's disadvantaged. A paper coauthored by Gebru earlier this year spotlights the impact of large language models' carbon footprint on minority communities and such models' tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

In particular, the effects of AI and machine learning model training on the environment have been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

While the environmental impact of training PanGu-α is unclear, it's likely that the model's footprint is substantial -- at least compared with language models a fraction of its size. As the coauthors of a recent MIT paper wrote, evidence suggests that deep learning is approaching computational limits. "We do not anticipate that the computational requirements implied by the targets ... The hardware, environmental, and monetary costs would be prohibitive," the researchers said. "Hitting this in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the net impact is this large a gain."

Antoniak says that it's an open question as to whether larger models are the right approach in natural language. While the best performance scores on tasks currently come from large datasets and models, whether the pattern of dumping enormous amounts of data into models will pay off is uncertain. "The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets," she said. "These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate."

Future directions

The PanGu-α team's choices aside, they might not have long to set standards that address the language model's potential impact on society. A paper published by researchers from OpenAI and Stanford University found that large language model developers like Huawei, OpenAI, and others may only have a six- to nine-month advantage until others can reproduce their work. EleutherAI, a community of machine learning researchers and data scientists, expects to release an open source implementation of GPT-3 in August.

The coauthors of the OpenAI and Stanford paper suggest ways to address the negative consequences of large language models, such as enacting laws that require companies to acknowledge when text is generated by AI -- perhaps along the lines of California's bot law. Other recommendations include:

Training a separate model that acts as a filter for content generated by a language model
Deploying a suite of bias tests to run models through before allowing people to use the model
Avoiding some specific use cases

The consequences of failing to take any of these steps could be catastrophic over the long term. In recent research, the Middlebury Institute of International Studies' Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 could reliably generate "informational" and "influential" text that might radicalize people into violent far-right extremist ideologies and behaviors. And toxic language models deployed into production might struggle to understand aspects of minority languages and dialects. This could force people using the models to switch to "white-aligned English," for example, to ensure that the models work better for them, which could discourage minority speakers from engaging with the models to begin with.

Given Huawei's ties with the Chinese government, there's also a concern that models like PanGu-α could be used to discriminate against marginalized peoples including Uyghurs living in China. A Washington Post report revealed that Huawei tested facial recognition software that could send automated "Uighur alarms" to government authorities when its camera systems identified members of the minority group.

We've reached out to Huawei for comment and will update this article once we hear back.

"Humans are also full of biases and toxicity, so I don't think learning like a human is a solution to these problems," Antoniak said. "Scholars think that perhaps we should try to better model how humans learn language -- [at least] in relation to language understanding, not toxicity. It would be possible to understand language and still be very toxic, after all."