Large open source projects on GitHub have intimidatingly long lists of problems that require addressing. To make it easier to spot the most pressing, GitHub recently introduced the “good first issues” feature, which matches contributors with issues that are likely to fit their interests. The initial version, which launched in May 2019, surfaced recommendations based on labels applied to issues by project maintainers. But an updated release shipped last month incorporates an AI algorithm that GitHub claims surfaces issues in about 70% of repositories recommended to users.
GitHub notes that it’s the first deep-learning-enabled product to launch on Github.com.
According to GitHub senior machine learning engineer Tiferet Gazit, GitHub last year conducted an analysis and manual curation to create a list of 300 label names used by popular open source repositories. (All were synonyms for either “good first issue” or “documentation,” like “beginner friendly,” “easy bug fix,” and “low-hanging-fruit.”) But relying on these meant that only about 40% of the recommended repositories had issues that could be surfaced. Plus, it left project maintainers with the burden of triaging and labeling issues themselves.
The new AI recommender system is largely automatic, by contrast. But building it required crafting an annotated training set of hundreds of thousands of samples.
GitHub began with issues that had any of the roughly 300 labels in the curated list, which it supplemented with a few sets of issues that were also likely to be beginner-friendly. (This included those that were closed by a user who had never previously contributed to the repository, as well as issues closed that touched only a few lines of code in a single file.) After detecting and removing near-duplicate issues, several training, validation, and test sets were separated across repositories to prevent data leakage from similar content, and GitHub trained the AI system using only preprocessed and denoised issue titles and bodies to ensure it detected good issues as soon as they’re opened.
In production, each issue for which the AI algorithm predicts a probability above the required threshold is slated for recommendation, with a confidence score equal to its predicted probability. Open issues from non-archived public repositories that have at least one of the labels from the curated label list are given a confidence score based on the relevance of their labels, with synonyms of “good first issue” awarded higher confidence than synonyms of “documentation.” At the repository level, all detected issues are ranked primarily based on their confidence score (though label-based detections are generally given higher confidence than ML-based detections), along with a penalty on issue age.
Data acquisition, training, and inference pipelines run daily, according to Gazit, using scheduled workflows to ensure the results remain “fresh” and “relevant.” In the future, GitHub intends to add better signals to its repository recommendations and a mechanism for maintainers and triagers to approve or remove AI-based recommendations in their repositories. And it plans to extend issue recommendations to offer personalized suggestions on next issues to tackle for anyone who has already made contributions to a project.