AI Weekly: Researchers attempt an open source alternative to GitHub's Copilot

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

In June, OpenAI teamed up with GitHub to launch Copilot, a service that provides suggestions for whole lines of code inside development environments like Microsoft Visual Studio. Powered by an AI model called Codex -- which OpenAI later exposed through an API -- Copilot can translate natural language into code across more than a dozen programming languages, interpreting commands in plain English and executing them.

Now, a community effort is underway to create an open source, freely available alternative to Copilot and OpenAI's Codex model. Dubbed GPT Code Clippy, its contributors hope to create an AI pair programmer that allows researchers to study large AI models trained on code to better understand their abilities -- and limitations.

Open source models

Codex is trained on billions of lines of public code and works with a broad set of frameworks and languages, adapting to the edits developers make to match their coding styles. Similarly, GPT Code Clippy learned from hundreds of millions of examples of codebases to generate code similar to how a human programmer might.

The GPT Code Clippy project contributors used GPT-Neo as the base of their AI models. Developed by grassroots research collective EleutherAI, GPT-NEo is what’s known as a Transformer model. This means it weighs the influence of different parts of input data rather than treating all the input data the same. Transformers don’t need to process the beginning of a sentence before the end. Instead, they identify the context that confers meaning on a word in the sentence, enabling them to process input data in parallel.

GPT-Neo was "pretrained" on the The Pile, a 835GB collection of 22 smaller datasets including academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more. Through fine-tuning, the GPT Code Clippy contributors enhanced its code understanding capabilities by exposing their models to repositories on GitHub that met a certain search criteria (e.g., had more than 10 GitHub stars and two commits), filtered for duplicate files.

"We used Hugging Face's Transformers library ... to fine-tune our model[s] on various code datasets including one of our own, which we scraped from GitHub," the contributors explain on the GPT Code Clippy project page. "We decided to fine-tune rather than train from scratch since in OpenAI's GPT-Codex paper, they report that training from scratch and fine-tuning the model [result in equivalent] performance. However, fine-tuning allowed the model[s] to converge faster than training from scratch. Therefore, all of the versions of our models are fine-tuned."

The GPT Code Clippy contributors have trained several models to date using third-generation tensor processing units (TPUs), Google's custom AI accelerator chip available through Google Cloud. While it's early days, they've created a plugin for Visual Studio, and plan to expand the capabilities of GPT Code Clippy to other languages -- particularly underrepresented ones.

"Our ultimate aim is to not only develop an open-source version of Github's Copilot, but one which is of comparable performance and ease of use," the contributors wrote. "[We hope to eventually] devise ways to update version and updates to programming languages."

Promise and setbacks

AI-powered coding models aren't just valuable in writing code, but also when it comes to lower-hanging fruit like upgrading existing code. Migrating an existing codebase to a modern or more efficient language like Java or C++, for example, requires expertise in both the source and target languages -- and it's often costly. The Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java.

But there are many potential pitfalls, such as bias and undesirable code suggestions. In a recent paper, the Salesforce researchers behind CodeT5, a Codex-like system that can understand and generate code, acknowledge that the datasets used to train CodeT5 could encode some stereotypes like race and gender from the text comments — or even from the source code. Moreover, they say, CodeT5 could contain sensitive information like personal addresses and identification numbers. And it might produce vulnerable code that negatively affects software.

OpenAI similarly found that Codex could suggest compromised packages, invoke functions insecurely, and produce programming solutions that appear correct but don’t actually perform the intended task. The model can also be prompted to generate racist and harmful outputs as code, like the word "terrorist" and "violent" when writing code comments with the prompt "Islam."

The GPT Code Clippy team hasn't said how it might mitigate bias that might be present its open source models, but the challenges are clear. While the models could, for example, eventually reduce Q&A sessions and repetitive code review feedback, they could cause harms if not carefully audited -- particularly in light of research showing that coding models fall short of human accuracy.

For AI coverage, send news tips to Kyle Wiggers -- and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer