Inside BigScience, the quest to build a powerful open language model

Roughly a year ago, Hugging Face, a Brooklyn, New York-based natural language processing startup, launched BigScience, an international project with more than 900 researchers that is designed to better understand and improve the quality of large natural language models. Large language models (LLMs) -- algorithms that can recognize, predict, and generate language on the basis of text-based datasets -- have captured the attention of entrepreneurs and tech enthusiasts alike. But the costly hardware required to develop LLMs has kept them largely out of reach of researchers without the resources of companies like OpenAI and DeepMind behind them.

Taking inspiration from organizations like the European Organization for Nuclear Research (also known as CERN), and the Large Hadron Collider, the goal of BigScience, then, is to create LLMs and large text datasets that will eventually be open-sourced to the broader AI community. The models will be trained on the Jean Zay supercomputer located near Paris, France, which ranks among the most powerful machines in the world.

While the implications for the enterprise might not be immediately clear, efforts like BigScience promise to make LLMs more accessible -- and transparent -- in the future. With the exception of several models created by EleutherAI, an open AI research group, few trained LLMs exist for research or deployment into production. OpenAI has declined to open source its most powerful model, GPT-3, in favor of exclusively licensing the source code to Microsoft. Meanwhile, companies like Nvidia have released the code for capable LLMs, but left the training of those LLMs to users with sufficiently powerful hardware.

"Obviously, competing directly with the behemoths is not really feasible, but as underdogs, we can leverage some of the things that make Hugging Face unique: the dynamism of a startup allows things to move fast, and the focus on open source allows us to work closely together with a strong community of like-minded researchers from academia and elsewhere," Douwe Kiela, who left Meta's (formerly Facebook's) AI research division his week to join Hugging Face as the new head of research, told VentureBeat via email. "[I]t is all about democratizing AI and leveling the playing field."

Democratizing LLMs

LLMs, like all language models, learn how likely words are to occur based on examples of text. Simpler models look at the context of a sequence of words, whereas larger models work at the level of sentences or whole paragraphs. Examples come in the form of text within training datasets, which contain terabytes to petabytes of data scraped from social media, Wikipedia, books, software hosting platforms like GitHub, and other sources on the public web.

Training a simple model can be achieved with commodity hardware, but the hurdles to deploying a state-of-the-art LLM are significant. LLMs like Nvidia's and Microsoft's Megatron 530B can cost up to millions of dollars to train from scratch, not accounting for expenses that might be incurred to store the model. Inference — actually running the trained model — is another barrier. One estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year.

EleutherAI's models and training dataset, which were released earlier this year, have made experimenting with and commercializing LLMs more feasible than before. But BigScience's work is broader in scope, with plans to not only train and release LLMs but address some of their major technical shortcomings.

Tackling inequality

The thrust of BigScience, which has its origins in discussions between Hugging Face chief science officer Thomas Wolf, GENCI's Stéphane Requena, and IDRIS' Pierre-François Lavallée, are collaborative tasks aimed at creating a dataset and LLMs as tools for research -- including fostering discussions on the social impact of LLMs. A steering committee gives members of BigScience scientific and general advice, while an organization committee designs the tasks as well as organizes workshops, hackathons, and public events.

Different working groups within BigScience's organization committee are charged with tackling challenges like data governance, archival strategies, evaluation fairness, bias, and social impact. "When striving for more responsible data use in machine learning, one thing to remember is that we don’t have all the answers and that we cannot speak for everyone," Yacine Jernite, a research scientist at Hugging Face, told VentureBeat via email. "A good governance structure allows more stakeholders to be involved in the process, and allows people whose lives will be affected by a technology to have a voice regardless of their level of technical expertise."

One goal of the BigScience working groups is to collect data that are sufficiently diverse and representative of the aforementioned training datasets. Drawing on expertise from communities like Machine Learning Tokyo, VietAI, and Masakhane, in addition to books, formal publications, radio recordings, podcasts, and websites, the dataset aims to encode different regions, cultural contexts, and audiences across languages including Swahili, Arabic, Catalan, Chinese, French, Bengali, Indonesian, Portuguese, and Vietnamese.

The benefits of LLMs aren't unevenly distributed strictly from a computation standpoint. English-language LLMs far outnumber LLMs trained in other languages, and after English, a handful of Western European languages dominate the field (in particular German, French, and Spanish). As the coauthors of a recent Harvard, George Mason, and Carnegie Mellon study on language technologies point out, the "economic prowess" of users of a language often drives the development of models rather than demographic demand.

Large multilingual and monolingual models trained in languages other than English, while rarely open-sourced, are becoming more common than they used to be -- thanks partly to corporate interests. But because of systemic biases in public data sources, non-English models don't always perform as well as their English-language counterparts. For example, languages in Wikipedia-based datasets vary not only by size but in the percentage of stubs without content, the number of edits, and the total number of users (because not all speakers of a language have access to Wikipedia). Beyond Wikipedia, ebooks in some languages, like Arabic and Urdu, are more commonly available as scanned images versus text, which requires processing with optical character recognition tools that can dip to as low as 70% inaccuracy.

As a part of its work, BigScience says that it has already produced a catalog of nearly 200 language resources distributed around the world. Contributors to the project have also created one of the largest public natural language catalogs for Arabic, called Masader, with over 200 datasets.

Modeling and training

BigScience has only begun the process of developing the LLMs, but its early work shows promise. With TPU Research Cloud credits and several hours of compute time on Jean Zay, BigScience researchers trained and evaluated a model called T0 (short for "T5 for zero-shot") that outperforms GPT-3 on a number of English-language benchmarks -- while being 16 times smaller than GPT-3. The most capable version -- dubbed T0++ -- can perform tasks it hasn't been explicitly trained to do, like generating cooking instructions for recipes and responding to questions about religion, human aging, machine learning, and ethics.

While T0 was trained on a range of publicly available, English-only datasets, future models will build on learnings from BigScience's data-focused working groups.

Much work remains to be done. BigScience researchers found that T0++ worrisomely generates conspiracy theories and exhibits gender bias, for example associating the word "woman" with "nanny" versus "man" with "architect" and answering in the affirmative when asked questions like "Do vaccines cause autism?" or "Is the earth flat?" The next phase of development will involve experiments with a model containing 104 billion parameters -- a little more than half the parameters in GPT-3 -- which will inform the final step in BigScience's roadmap: training a multilingual model with up to 200 billion total parameters. Parameters are the part of an algorithm that's learned from historical training data, and they often -- but not always -- correspond to sophistication.

"Our modeling group has focused on drafting and validating an architecture and training setup that will allow us to get the best out of our final GPU budget," Julien Launay, a research scientist at AI chip startup LightOn and lead of architecture at BigScience, told VentureBeat. "We need to make sure the final architecture is proven, scalable, efficient, and suitable for multilingual training."

Max Ryabinin, a research scientist at Yandex who's contributing to BigScience's model design work, says that one of the major engineering challenges is ensuring the stability of BigScience's large-scale language model training experiments. Although it's possible to obtain smaller models without "significant issues," at the over-10-billion-parameter scale, the process becomes much less predictable, he said.

"Unfortunately, right now this question is not covered in detail by most research papers: even those that describe largest neural networks to date often omit the information about how to tackle such instabilities, and without this knowledge, it becomes much harder to reproduce the results of these large models," Ryabinin told VentureBeat. "Hence, we decided to run a series of preliminary experiments on a smaller 100 billion scale to encounter as many instabilities as possible before the main run, to compare different known methods for mitigating these instabilities, and to openly document our findings for the benefit of the broader machine learning community."

Working groups

Meanwhile, newly-formed BigScience working groups will investigate and develop frameworks addressing the impact of LLMs on privacy, including informed consent. One team is exploring the legal challenges that might come up along the design of LLMs and working on an ethical charter for BigScience. Another is looking into creating LLMs datasets, models, and software tools for proving theorems in mathematics.

"From a legal perspective, the ethical and legal challenges stemming from potential misuses of LLMs have pushed us to design a specific licensing framework for the artifacts we develop. Consequently, we are currently working on an open license integrating a set of use-based restrictions we identified as potentially harmful for individuals," Carlos Muñoz Ferrandis, a researcher at the Max Planck Institute for Innovation and a member of BigScience, told VentureBeat via email. "There is a balance to be struck between, on the one hand, our aim of maximizing open access to natural language processing-related artifacts, and on the other, the potential harmful uses of the latter due to licensing frameworks not taking into account the capabilities of LLMs. With regards to data and its governance, we are also taking into account legal challenges such as the negotiation with specific data providers for the use of their datasets, or, critical legal considerations to take into account when crawling data from the web -- e.g., personal information; legal uncertainty around copyright-related exceptions."

According to Margaret Mitchell, who heads up data governance efforts at Hugging Face, Hugging Face's 2022 plans -- some of which will support BigScience -- are an increased focus on tooling for AI workflows, developing libraries for LLM training and evaluation, and standardizing "data cards" and "model cards" that provide information about LLMs' capabilities. Mitchell previously cofounded and led Google's ethical AI team before the company controversially dismissed her for what it claims were code of conduct violations.

"[We're developing] tooling for data development that removes the barrier of needing to directly code, opening the door for people from non-engineering backgrounds who have expertise in critical areas for AI right now, such as social science, to develop directly ... This makes it a lot easier to, for example, identify problematic biases before they propagate in the machine learning lifecycle," Mitchell told VentureBeat via email. "[We're also] developing libraries for training and evaluation that enable developers to incorporate state of the art advances in creating 'fair' models ... or selecting instances that meet diversity criteria."

BigScience plans to conclude its work in May when the project's members are scheduled to present at a workshop during the Association for Computational Linguistics 2022 conference in Dublin. By then, the goal is to have a very large LLM that at least meets -- and ideally exceeds -- the performance of other top-performing LLMs

"Going forward, we plan to grow the [Hugging Face] team substantially and will also be looking to hire research interns, residents, and fellows across the world," Kiela said.

Impact

In the enterprise, BigScience's work could spur a new wave of AI-powered products from organizations that didn't previously have the means to leverage LLMs. Language models have become a key tool in industries like health care and financial services, where they're used to process patents, derive insights from scientific papers, recommend news articles, and more. But increasingly, smaller organizations have been left out of the cutting-edge advancements.

In a 2021 survey by John Snow Labs and Gradient Flow, companies cited accuracy as the most important requirement when evaluating a language model, followed by production readiness and scalability. Costs, maintenance, and data sharing were pegged as outstanding challenges.

With any luck, BigScience, too, will solve some of the biggest and most troubling issues with LLMs today, like their tendency to -- even when "detoxified" -- spout falsehoods and exhibit biases against religions, sexes, races, and people with disabilities. In a recent paper, scientists at Cornell wrote that "propaganda-as-a-service" may be on the horizon if large language models are abused.

For all their potential to harm, LLMs still struggle with the basics, often breaking semantic rules and endlessly repeating themselves. For example, models frequently change the subject of a conversation without a segue or answer questions with contradictions. LLMs also poorly understand nuanced subjects like morality, history, and law. And they inadvertently expose personal information in the public datasets on which they were trained.

"With the [Hugging Face] research team, we want to find the right balance between bottom-up research (as in Meta's research division) and top-down research (as in DeepMind and OpenAI). In the former case, you get unnecessary friction, competition, and resource scarcity; in the latter case, you inhibit researchers’ freedom and creativity," Kiela continued. "Our people come from established places like Google, Meta, and academia, so we’re at a perfect crossroads to try to create a new type of environment for facilitating groundbreaking research, building on what has and importantly what we think has not been working at these older labs."