How Hugging Face is tackling bias in NLP

Given that natural language processing (NLP) is a subset of artificial intelligence (AI), models need to train on large volumes of data. Unfortunately, many researchers are unable to access or develop the models and datasets necessary for robust systems -- they are mostly the purview of large technology giants.

Hugging Face, the winner of VentureBeat's Innovation in Natural Language Process/Understanding Award for 2021, is looking to level the playing field. The team, launched by Clément Delangue and Julien Chaumond in 2016, was recognized for its work in democratizing NLP, the global market value for which is expected to hit $35.1 billion by 2026. This week, Google's former head of Ethical AI Margaret Mitchell joined the team.

There are many reasons to democratize access to NLP, says Alexander (Sasha) Rush, associate professor at Cornell University and a researcher at Hugging Face. In addition to the technology being shaped and developed by just a few large tech companies, the language can be overly focused on English, he pointed out in an email interview. Also, "text data can be particularly sensitive to privacy or security concerns," Rush said, "users often want to run their own version of a model," he added.

Today, Hugging Face has expanded to become a robust NLP startup, known primarily for making open-source software such as Transformers and Datasets, used for building NLP systems. "The software Hugging Face develops can be used for classification, question answering, translation, and many other NLP tasks," Rush said. Hugging Face also hosts a range of pretrained NLP models, on GitHub, that practitioners can download and apply for their problems, Rush added.

The datasets challenge

One of the many projects that Hugging Face works on, is related to datasets. Given that datasets are essential to NLP -- "Every system from translation to question answering to dialogue starts with a dataset for training and evaluation," Rush said -- their numbers have been growing.

"As NLP systems have started to become more accurate there has been a large growth in the number and size of datasets produced by the NLP community, both by academics and community practitioners," Rush pointed out. According to Rush, chief scientist Thomas Wolf developed Datasets "to help standardize the distribution, documentation and versioning of these datasets, while also making them easy and efficient to access."

Hugging Face's Datasets project is a community library of natural language processing, which has collected 650 unique datasets from more than 250 global contributors. Datasets has facilitated a large variety of research projects. "In particular we are seeing new use cases where users run the same system across dozens of different datasets to test generalization of models and robustness on new tasks. For instance, models like OpenAI's GPT-3 use a benchmark of many different tasks to test ability to generalize, a style of benchmarking that Datasets makes possible and easy to do," Rush said.

Addressing diversity and bias

Datasets is just one of the many projects Hugging Face is working on; the startup also tackles larger questions related to the field of AI. To address the challenge of increasing diversity of language-related datasets, the startup is making adding datasets as easy as possible so that any community member can do so, Rush said. Hugging Face is also hosting joint community events with interest groups such as Bengali NLP and Masakhane, a grassroots NLP community for Africa.

Bias in AI datasets is a known problem, and Hugging Face is tackling the challenge by strongly encouraging users to write extensive documentation, including known biases, when they add a dataset. Hugging Face provides users with a template and guide for this process. "We do sometimes ask users to reconsider adding datasets if they are problematic," said Yacine Jernite, a research scientist at Hugging Face, via email. "This has only happened in rare cases and through a direct conversation with the person who suggested the dataset." In one instance, a community member was looking to add problematic jokes from Reddit, so Hugging Face talked to the user, who took them down.

Hugging Face is also knee-deep in a project called BigScience, an international, multi-company, multi-university research project with over 500 researchers, designed to better understand and improve results on large language models. "The project is multi-faceted as well, incorporating both engineering aspects of how to produce larger, more accurate models with groups studying social and environmental impact and data governance," Rush said.

The datasets challenge

Addressing diversity and bias

More