How Hugging Face and ServiceNow tackle code-generating LLM challenges

A little over a year ago, using large language models (LLMs) to generate software code was a cutting-edge scientific experiment that had yet to prove its worth. But while code generation has become one of the most successful applications of LLMs, BigCode, launched recently by Hugging Face and ServiceNow, has strived to address some of code-generating LLMs biggest pain points.

Today, many developers are using LLM-powered tools such as GitHub Copilot to improve productivity, stay in the flow and make their work more enjoyable. However, as LLM-powered coding matures, we’re also beginning to discover the challenges it must overcome, including licensing, transparency, security and control.

The Stack, a dataset of source code recently released by the BigCode project, addresses some of these pain points. It also highlights some of the known barriers that remain to be resolved as artificial intelligence (AI)-powered code generation continues to move into the mainstream.

LLMs and code license

“The recent introduction of code LLMs has shown that they can make developers more productive and make software engineering accessible to people with less technical backgrounds,” Leandro von Werra, machine learning engineer at Hugging Face, told VentureBeat.

These language models can serve a variety of tasks. Programmers are using tools such as Copilot and Codex to write entire classes and functions from textual descriptions. This can be very useful for automating mundane parts of programming, such as setting up web servers, pulling information from databases or even writing Python code for a neural network and its training loop. According to von Werra, in the future, software engineers will be able to use LLMs to maintain legacy code written in an unfamiliar programming language.

However, the growing use of LLMs in coding has raised several concerns, including licensing issues. Models like Copilot generate code based on patterns they have learned from their training examples, some of which might be subject to restrictive licenses.

“Questions have been raised as to whether these AI models respect current open-source licenses—both for model training and generation—and what the social impact of this technology is on the open-source software community,” von Werra said.

Even when open-source licenses legally permit the use of a code repository, these licenses were developed before modern deep learning and the collection of large datasets for training models. Therefore, the developers might not have intended for their code to be used in training language models.

“The issues of consent and intent around the use of peoples’ code to train deep neural networks are not addressed in current open-source licenses; the community still has to develop norms around how to responsibly develop this technology, respecting the wishes of developers for modern use of their content,” von Werra said.

Hugging Face and ServiceNow release collaborative project

The BigCode project is a collaboration between Hugging Face and ServiceNow, announced in September. The Stack, which was released on October 27, comprises 3 TB of “permissively licensed source code” obtained from GitHub, assembled for training large language models for code generation.

Permissive licenses are those that have the least restrictions on copying, modifying and redistributing the code, which includes the MIT and Apache 2.0 licenses. It does not include “copyleft” licenses such as GPL, which require that the same rights be preserved in code that is derived from the original repository. There are currently controversies and disagreements surrounding whether models trained on copyleft licenses are considered derivative work.

Limiting the dataset to permissively licensed code will make sure it can be used for different applications.

“The goal of The Stack is to enable researchers from academia and industries to collaborate on research and development of large language models for code applications by releasing a dataset that can be shared, investigated and used to pretrain new systems,” von Werra said.

BigCode is also taking measures to provide developers with more control over their code. Developers can explicitly opt out from having their repository included in The Stack and used to train LLMs, regardless of the license they initially chose.

“In order to honor these opt-out requests, developers that wish to opt-out can submit a request and, once validated, their code will be removed from future versions of The Stack,” von Werra said.

Promoting openness in code LLMs

One of the challenges facing researchers working on code LLMs is the lack of openness and transparency around the development of these systems. Models such as AlphaCode, CodeParrot and CodeGen have only described the high-level data collection process but did not release the training data.

“It is difficult for other researchers to fully reproduce these models and understand what kind of pretraining data leads to high-performing code LLMs,” von Werra said. “By releasing an open large-scale code dataset, we hope to make training of code LLMs more reproducible.”

In addition to providing an unprecedented 3 TB of curated source code, the BigCode team has provided a detailed breakdown of how the code was obtained and filtered. The dataset was gathered over several months. The team downloaded 137.36 million publicly available GitHub repositories. It then filtered the dataset to exclude repositories that did not have permissive licenses. Finally, it went through a deduplication process to remove files that were exact or near duplicates of others.

“An open dataset benefits from external scrutiny, with BigCode providing a way for other researchers and developers to report issues directly to the team managing the dataset,” von Werra said.

Hugging Face and ServiceNow tackle remaining challenges

Licensing is not the only challenge that code LLMs face. The engineers of the models and the curators of the datasets must also address other problems such as removing sensitive information, including usernames, passwords and security tokens.

Another concern is insecure code. Since LLMs are trained on source code curated from public sources, there is worry that the training dataset might include insecure code. Alternatively, malicious actors can poison the training data by intentionally spreading insecure code in open repositories. The LLM will then learn the insecure coding patterns and replicate them in response to developer prompts.

The open-source format of The Stack will allow security researchers to scrutinize the dataset for insecure code. Additionally, the BigCode team has implemented update mechanisms that take advantage of new information, such as the disclosure of vulnerabilities, and evolving best practices to limit the spread of malicious code in The Stack. The team is also developing methods to filter personally identifiable information (PII).

Moreover, the team is working on a special license dedicated to code LLMs named OpenRAIL (Responsible AI License).

“The OpenRAIL license is an open-source license similar to Apache 2.0, but also includes provisions to prohibit certain use cases that could, for example, exclude the generation of malware,” von Werra said. “In addition, we are also working on a tool to search generated code inside The Stack for correct license attribution.”

The future of code LLMs

LLMs will be able to both extend the abilities of professional software engineers and enable non-technical people to build new software. But that will only happen if the community can establish a new set of sustainable rules and best practices around licensing and attribution, von Werra warned. He also believes that automation does not mean that human skills will become less relevant in coding.

“There will need to be a lot more internal governance in place at organizations that adopt the technology,” von Werra said. “The role of the human-in-the-loop in the AI value chain will become more important to ensure that generated code is fit for purpose and compliant with corporate policy and broader AI regulation.”