The problem of underrepresented languages snowballs from data sets to NLP models

Just how comprehensively do natural language processing (NLP) pipelines support widely spoken languages? A recent study coauthored by researchers at Clarkson University and Iona College sought to investigate the degree to which NLP tools understand eight dialects: English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and the Senegalese language Wolof. Their findings suggest there are caveats even in cases where a tool technically supports a language, preventing full participation and leading to underrepresentation of certain voices.

A typical NLP pipeline involves gathering corpora, processing them into text, identifying language elements, training models, and using these models to answer specific questions. The degree to which some languages are underrepresented in data sets is well-recognized, but the ways in which the effect is magnified throughout the NLP toolchain is less discussed, the researchers say.

The vast majority of NLP tools are developed in English, and even when they gain support for other languages, they often lag behind English with respect to robustness, accuracy, and efficiency, the coauthors assert. In the case of BERT, a state-of-the-art pretraining technique for natural language processing, developers released an English model and subsequently Chinese and multilingual models. But the single-language models retain performance advantages over the multilingual models, with both English and Chinese monolingual models performing 3% better than the combined English-Chinese model. Moreover, when smaller BERT models for teams with restricted computational resources were released, all 24 were in English.

Lack of representation at each stage of the pipeline adds to a lack of representation in later stages, the researchers say. As something of a case in point, the multilingual BERT model was trained on the top 100 languages with the largest Wikipedia article databases, but there are substantial differences in the size and quality of the databases when adjusting for the number of speakers. They vary not only by the file size of the corpora and the total number of pages, but along dimensions including the percentage of stubs without content, number of edits, number of admins working in that language, total number of users, and total number of active users.

For example, there are approximately:

1.12 million Wikipedia articles in Chinese for a total of 0.94 articles per 1,000 speakers, given the estimated 1.19 billion Chinese speakers worldwide.
6.1 million articles in English, or 12.08 articles per 1,000 speakers (given 505 million speakers worldwide)
1.6 million in Spanish, or 3.42 articles per 1,000 speakers (given 470 million speakers worldwide)
1.04 million articles in Arabic, or 3.33 articles per 1,000 speakers (given 315 million speakers worldwide)
2.22 million articles in French, or 29.70 articles per 1,000 speakers (given 75 million speakers worldwide)
732,106 articles in Farsi, or 10.17 articles per 1,000 speakers (given 72 million speakers worldwide)
155,298 articles in Urdu, or 2.43 articles per 1,000 speakers (given 64 million speakers worldwide)
1,393 articles in Wolof, or 0.14 articles per 1,000 speakers (given 10 million speakers worldwide)

The databases are even less representative than they might appear because not all speakers of a language have access to Wikipedia. In the case of Chinese, it's banned by the Chinese government, so Chinese articles in Wikipedia are more likely to have been contributed by the 40 million Chinese speakers in Taiwan, Hong Kong, Singapore, and overseas.

Technical hurdles also tend to be higher for some languages than others, the researchers found. For instance, a script they used to download the Chinese, English, Spanish, Arabic, French, and Farsi corpora from Wikipedia experienced a 0.13% error rate for Farsi and a 0.02% error rate for Chinese but no errors across 5 million English articles. And for the Urdu and Wolof corpora, the script wasn't compatible because it lacked support for their formats.

Beyond Wikipedia, researchers experienced issues assembling ebooks in each language, which are often used to train NLP models. For Arabic and Urdu, many titles were available as scanned images rather than text format, requiring processing by optical character recognition tools that ranged in accuracy from 70% to 98%. With Chinese ebooks, the optical character tool the researchers used incorrectly added spaces to each new line. And because the Wolof language doesn't have a written character set, the team was forced to rely on English, French, and Arabic transcriptions that might have taken stylistic liberties.

"Despite huge and admirable investments in multilingual support in projects like Wikipedia and BERT we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world," the researchers wrote. "We document how lack of representation in the early stages of the NLP pipeline (e.g. representation in Wikipedia) is further magnified throughout the NLP-tool chain, culminating in reliance on easy-to-use pre-trained models that effectively prevents all but the most highly resourced teams from including diverse voices. We highlight the difficulties that speakers of many languages still face in having their thoughts and expressions fully included in the NLP-derived conclusions that are being used to direct the future for all of us."