NLP needs to be open. 500+ researchers are trying to make it happen

The acceleration in Artificial Intelligence (AI) and Natural Language Processing (NLP) will have a fundamental impact on society, as these technologies are at the core of the tools many of us use on a daily basis. However, the resources necessary to create the best-performing AI and NLP models are found mainly at technology giants.

The stranglehold tech giants have on this transformative technology poses a number of problems, ranging from who decides which research gets shared to its impacts on environmental and ethical fronts. For example, while recent NLP models such as GPT3 (from OpenAI and Microsoft) show interesting behaviors from a research point of view, such models are private and only restricted access -- or no access at all -- is provided to many academic organizations, making it impossible to answer important questions around these models and study capabilities, limitations, potential improvements, bias, and fairness.

A group of more than 500 researchers from 45 different countries -- from France, the US, and Japan to Indonesia, Ghana, and Ethiopia -- has come together to work towards tackling some of these problems. The project, which the authors of this article are all involved in, is called Big Science, and our goal is to improve the scientific understanding of the capabilities and limitations of large-scale neural network models in NLP and to create a diverse and multilingual dataset and a large-scale language model as research artifacts, open to the scientific community.

BigScience was inspired by scientific creation schemes existing in other scientific fields, such as CERN and the LHC in particle physics, in which open scientific collaborations facilitate the creation of large-scale artifacts useful for the entire research community. So far, a broad range of institutions and disciplines have joined the project in its year-long effort that started in May 2021.

The project has more than 20 working groups and subgroups tackling different aspects of language modeling in parallel, some of which are closely related and interdependent. Data plays a crucial role in the process. In machine learning, a model learns to make predictions based on data it has seen before. The datasets that large language models are typically trained on are massive, mostly English-centric, and sourced from the web, which raises questions about bias, fairness, ethics, and privacy, among others.

Thus, the collective seeks to implement an intentional constitution of the training dataset to favor linguistic, geographical and social representativeness rather than the opportunistic practices that currently define the training data used in very large models. Our data effort also strives to identify the rights of the language owners, subjects, and communities. This is as much an organizational and social challenge as it is a technical challenge. The engineering and modeling groups are dedicated to determining architecture design and scaling laws, for instance, with the concrete goal of training a language model with a capacity of up to 210 billion machine learning parameters on the French Jean Zay supercomputer at IDRIS.

One of our objectives is to uncover and understand the mechanisms that enable a language model to produce valid output on any natural task description it has been given without explicitly being trained to do so (an ability known as zero-shot behavior). Another point of interest is studying how a language model can be updated through time. We also have a group of researchers working on tokenization strategies for a diverse set of languages and modeling multilinguality to ensure that all NLP capabilities are transposed to languages other than English. Others are working on the social impact, carbon footprint, data governance, and legal implications of NLP models and how to extrinsically and intrinsically evaluate them for accuracy.

As the output of this enormous effort, BigScience aims to share a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of ethical and legal issues, a large-scale multilingual language model exhibiting non-trivial zero shot behaviors in a way that is accessible to all researchers, as well as code and tools associated with these artifacts to enable easy use. Apart from that, this is an opportunity to create a blueprint on how to do large-scale research initiatives in AI. Our effort keeps evolving and growing, with more researchers joining every day, making it already the biggest open science contribution in artificial intelligence to date.

Much like the tensions between proprietary and open-source software in the early 2000s, AI is at a turning point where it can either go in a proprietary direction, where large-scale state-of-the-art models are increasingly developed internally in companies and kept private, or in an open, collaborative, community-oriented direction, marrying the best aspects of open-source and open-science. It's essential that we make the most of this current opportunity to push AI onto that community-oriented path so that it can benefit society as a whole.

Yacine Jernite is a Research Scientist at HuggingFace. He coordinates the Data effort of the BigScience project as area chair and co-organizer of the data governance group.

Matthias Gallé leads various research teams at Naver Labs Europe, focused on developing AI for our Digital World. His focus for BigScience is on how to inspect, control, and update large pre-trained models.

Victor Sanh is a Research Scientist at Hugging Face. His research focuses on making NLP systems more robust for production scenarios and mechanisms behind generalization.

Samson Tan is a final year computer science PhD candidate at the National University of Singapore and co-chair of the Tokenization working group in BigScience.

Thomas Wolf is co-founder and Chief Science Officer of HuggingFace and co-leader of the BigScience initiative.

Suzana Ilic is a Technical Program Manager at Hugging Face, co-leading the organization of BigScience.

Margaret Mitchell is an industrial AI research scientist and co-chair of the Data Governance working group in BigScience.

More