Researchers open-source benchmarks measuring quality of AI-generated code

The applications of computer programming are vast in scope. And as computers become ubiquitous, the demand for quality code draws an ever-growing number of aspiring programmers to the profession. After years of study to become proficient at coding, experts learn to convert abstracts into concrete, executable programs. But what if AI could do the same?

In recent years, large-scale AI language models have shown promise in generalizing to tasks including writing code, implying that humans' work may be one day supplemented by AI systems. But while some studies show that language models can translate code and fix compilation issues, there's been little work on rigorously testing the coding ability of models given general coding problems.

That's why a team of researchers at the University of California at Berkeley, Cornell, the University of Chicago, and the University of Illinois at Urbana-Champaign created APPS, a benchmark for code generation from natural language specifications. Unlike prior work on code generation, which mostly focuses on code translation and pseudocode-to-code, the researchers tested models on their ability to take specifications and write code that meets these specifications.

Their work comes on the heels of the release of IBM's Project CodeNet, one of the largest open source dataset for benchmarking around AI for code. But CodeNet centers around the problems of code translation, code similarity, and code constraints. APPS is broader in scope, evaluating models not only on their ability to understand coding syntax but on their ability to comprehend task descriptions and create algorithms to solve these tasks.

"APPS enables robust evaluation of models along several dimensions, providing a precise and comprehensive view of code generation ability," the coauthors wrote in a paper detailing their work. "If a model were to perform well on APPS, this would indicate an ability to flexibly use data structures and programming techniques, as well as an ability to correctly interpret diverse task specifications, follow instructions, and understand human intent."

APPS contains 10,000 programming problems in Python, Java, and C++ ranging in difficulty from introductory to coding competition challenges, as well as a bank of over 130,000 test cases and more than 230,000 human-written solutions for evaluation. The test cases were chosen to create a gold-standard metric for model performance, including correct functionality across edge cases. And most were taken from open access coding websites including Codeforces and Kattis.

The introductory problems in APPS, which include counting the number of appearances of a substring and finding if a string is a palindrome, can be solved by programmers with 1-2 years of experience without requiring algorithms. The intermediate, interview-level problems are more difficult in nature and at the level of questions asked in typical technical interviews. As for the competition-level problems, they're even more challenging and representative of those in high school and collegiate programming competitions like the United States of America Computing Olympiad (USACO).

Results

The researchers tested several types of models on APPS, including OpenAI's GPT-2, GPT-3, and an open source version of GPT-3 called GPT-Neo. In experiments, they discovered that the models could learn to generate code that solves easier problems but not without syntax errors. Approximately 59% of GPT-3's solutions for introductory problems had errors, while GPT-Neo averaged 3%. Moreover, the best-performing model -- GPT-Neo -- attained only 10.15% accuracy (excluding edge cases) and 1.12% strict accuracy (including edge cases) across introductory-, interview-, and competitive-level problems, indicating that there's substantial room for improvement.

"These results position code generation as a challenging but tractable testbed for large-scale language models ... Writing code to meet specifications in natural language is an economically valuable task with widespread social implications should it be solved, in that it could eventually facilitate malicious code generation and one day result in job automation. As large-scale language models have the potential to make significant progress on code generation, it is essential that we begin to track advancements on this task," the researchers wrote.

Several efforts are underway to create viable AI-powered coding tools, including Intel's ControlFlag, which can autonomously detect errors in code. Codota is developing a platform that suggests and autocompletes scripts in Python, C, HTML, Java, Scala, Kotlin, and JavaScript. Ponicode taps AI to check the accuracy of code, and DeepCode offers a machine learning-powered system for whole-app code reviews (as does Amazon). Perhaps one of the most impressive projects to date is TransCoder, an AI system Facebook researchers developed that converts code from one programming language into another. Another contender is a model from OpenAI that was trained on GitHub repositories to generate entire functions from English-language comments.

According to a study from the University of Cambridge's Judge Business School, programmers spend 50.1% of their work time not programming; half of the rest of their time is spent debugging. And the total estimated cost of debugging is $312 billion per year. AI-powered code suggestion and review tools, then, promise to cut development costs substantially while enabling coders to focus on more creative, less repetitive tasks.