The landscape for generative AI for code generation got a bit more crowded today with the launch of the new StarCoder large language model (LLM).

StarCoder is part of the BigCode Project, a joint effort of ServiceNow and Hugging Face. BigCode was originally announced in September 2022 as an effort to build out an open community around code generation tools for AI. The StarCoder LLM is a 15 billion parameter model that has been trained on source code that was permissively licensed and available on GitHub.

The model has been trained on more than 80 programming languages, although it has a particular strength with the popular Python programming language that is widely used for data science and machine learning (ML).

Market heating up

The effort to build an open generative AI code generation tool brings new competition to OpenAI’s Codex, which powers the GitHub co-pilot service, as well as efforts from other vendors including Amazon’s CodeWhisper tool. Both OpenAI and Amazon tools are based on proprietary code, whereas StarCoder is being made available under an Open Responsible AI Licenses (OpenRAIL) license.


“There are powerful code models out there, but they are all closed source, nobody knows exactly how to train them,” Leandro von Werra, ML engineer at Hugging Face and co‑lead of BigCode, told VentureBeat. 

Von Werra added that the idea behind BigCode and StarCoder is to build powerful code generation models in the open. While the effort is led by Hugging Face and Service now, he emphasized that there is an active community of approximately 600 people in the community that are contributing to the project’s success.

BigCode is spiritual successor of BigScience

The BigCode effort isn’t the first time that HuggingFace has helped to build a community to open up AI development.

Von Werra called BigCode the ‘spiritual successor’ of the BigScience effort, which got started in 2021. In 2022, the BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) was released, providing a multi-language text generation model intended to be an open alternative to OpenAI’s GPT-3.

BigCode has had a few iterative steps on the path toward the release of StarCoder.  In October 2022, the project announced “The Stack,” a collection of permissively licensed code collected from GitHub as a training data set for LLM code generation. In December 2022, BigCode released its first ‘gift’ with SantaCoder, a precursor model to StarCoder trained on a smaller subset of data and limited to Python, Java and JavaScript programming languages.

With StarCoder, the project is providing a fully-featured code generation tool that spans 80 languages. Harm de Vries, lead of the LLM lab at ServiceNow Research and co‑lead of BigCode, explained to VentureBeat that StarCoder can be used in a variety of scenarios. For example, he demonstrated how StarCoder can be used as a coding assistant, providing direction on how to modify existing code or create new code.

The StarCoder LLM can run on its own as a text to code generation tool and it can also be integrated via a plugin to be used with popular development tools including Microsoft VS Code. Von Werra noted that StarCoder can also understand and make code changes. For example, a user can use a text prompt such as ‘I want to fix the bug in this function’ and the LLM will do just that.

Why explainable AI needs an open license

A critical aspect of StarCoder and the BigCode effort in general is that the technologies are all available under an open license.

A key challenge for organizations deploying AI today is the need for explainable AI, where it is possible to understand how and why a model made certain choices and decisions. A related challenge is the need to ensure that AI is used responsibly and doesn’t cause harm to people via toxic content or malware.  To help solve those thorny issues, BigCode is using OpenRail licenses and for StarCoder in particular, the  Code Open RAIL‑M license.

“We know these models are very powerful and we want to make sure that they’re used for good use cases and not for use cases which will have bad implications,” said De Vries.

The Code Open RAIL‑M license allows users to see the code inside the model with a restrictions intended to prevent code from being misused — such as using it to generate ransomware or a social engineering attack.

“It’s completely open like an open source license,” said De Vries. “It just comes with the restrictions that make sure we stick to our responsible AI principles.”

