PolyCoder is an open source AI code-generator that researchers claim trumps Codex

Code generation AI -- AI systems that can write in different programming languages given a prompt -- promise to cut development costs while allowing coders to focus on creative, less repetitive tasks. But while research labs like OpenAI and Alphabet-backed DeepMind have developed powerful code-generating AI, many of the most capable systems aren't available in open source. For example, the training data for OpenAI's Codex, which powers GitHub's Copilot feature, hasn't been made publicly available, preventing researchers from fine-tuning the AI model or studying aspects of it such as interpretability.

To remedy this, researchers at Carnegie Mellon University -- Frank Xu, Uri Alon, Graham Neubig, and Vincent Hellendoorn -- developed PolyCoder, a model based on OpenAI's GPT-2 language model that was trained on a database of 249GB of code across 12 programming languages. While PolyCoder doesn't match the performance of top code generators in every task, the researchers claim that PolyCoder is able to write in C with greater accuracy than all known models, including Codex.

"When GitHub’s Copilot came out last summer, it became clear that these very large language models of code can be very useful for helping developers and increasing their productivity. But no models even close to that scale were publicly available," the researchers told VentureBeat via email. "So [PolyCoder] started with Vincent just trying to see what the biggest model was that could be trained on our lab server, which ended up being 2.7 billion parameters ... and that model was a league ahead of other code-oriented models that were publicly available at the time."

In machine learning, parameters are the part of the model that’s learned from historical training data. The correlation between the number of parameters and sophistication has held up remarkably well -- generally speaking.

Investigating code generation

A growing number of organizations are exploring code-generating AI. During its Build developer conference in May 2021, Microsoft detailed a new feature in Power Apps that taps OpenAI’s GPT-3 language model to assist people in choosing formulas. Intel’s ControlFlag can autonomously detect errors in code. And Facebook’s TransCoder converts code from one programming language into another.

DeepMind more recently announced AlphaCode, which the lab claims is among the first code generation systems competitive with human programmers. In programming competitions hosted on Codeforces, a platform for programming contests, DeepMind says that AlphaCode achieved an average ranking within the top 54.3% across recent contests with more than 5,000 participants.

But the Carnegie Mellon researchers note that "nearly no one" outside of well-resourced companies can train models anywhere near the size of AlphaCode or Codex. A 2020 study from startup AI21 Labs pegged the cost of training a text-generating model with 1.5 billion parameters -- about half the size of PolyCode -- at between $80,000 to $1.6 million. Copilot has 12 billion parameters.

"Large tech companies aren’t publicly releasing their models, which is really holding back scientific research and democratization of such large language models of code," the researchers said. "To some extent, we hope that our open-sourcing efforts will convince others to do the same. But the bigger picture is that the community should be able to train these models themselves. Our model pushed the limit of what you can train on a single server -- anything bigger requires a cluster of servers, which dramatically increases the cost."

Setbacks in code generation

In developing PolyCoder, the researchers also studied and compared the performance of different code-generating AI systems including Codex (through an API). Interestingly, they found that models mostly trained on English text and only on a bit of source code turned out to be very good at generating code -- perhaps because they got code-related insights from resources like the developer Q&A website Stack Overflow that were included in the 249GB database

"A promising approach to building strong code-generating models seems to be to train on diverse sources of programming knowledge, including code in a broad mix of programming languages, but also text from around the web related to code," the researchers said.

The researchers express concern that models like PolyCoder could be prompted to generate buggy programs, including ones with hard-to-detect security vulnerabilities. In the future, they fear that adversaries could "hide" malicious behavior in code-generating models that only shows up given the right prompt, like a keyword (e.g., a company or product name), or upload vulnerable code likely to be picked up by legitimate code-generating models.

They suggest open-sourcing Codex-sized models as one way to combat this, which could enable security researchers to look for failure modes in these models. As a side benefit, open-sourcing would allow developers to personalize the models or "teach" them new programming languages through a process known as fine-tuning, which less less cost-intensive than training the models from scratch.

"While industry currently has much more computational resources, there is still a lot of room for innovation from academia and the research community, including building smaller and faster personalized models that don’t rely on an internet connection, useful applications such as detecting and repairing bugs, automatic code reviewing, and more. Those are tasks where the research community has built promising prototypes that could really benefit from the power of these kinds of very large language models," the researchers said. "Decentralized training, where multiple groups team up to train a large model jointly, could make a big difference here. Research grants and collaborations between companies and academia could also help."

Investigating code generation

Setbacks in code generation

More