GitHub today announced that it’s releasing activity data for 2.8 million open source code repositories and making it available for people to analyze with the Google BigQuery cloud-based data warehousing tool.
The data set is free to explore. (With BigQuery you get to process up to one terabyte each month free of charge.)
This new 3TB data set includes information on “more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions,” Arfon Smith, program manager for open source data at GitHub, wrote in a blog post.
To get people started, Smith has put together some starter queries. Felipe Hoffa, a Google developer advocate who focuses on BigQuery, has put together some tips for working with the data sets in a Medium post.
The data set could be useful to anyone who want to get a sense of trends in open source software use on GitHub, and it’s simpler than tinkering with the GitHub application programming interface (API). For sure, GitHub, with more than 15 million users, isn’t the only place where open source software lives on the Internet — see also GitLab — but it is a very popular one, perhaps the most popular.
GitHub will update the data set every week, a spokesperson told VentureBeat in an email.