GitHub today announced that it’s releasing activity data for 2.8 million open source code repositories and making it available for people to analyze with the Google BigQuery cloud-based data warehousing tool.
The data set is free to explore. (With BigQuery you get to process up to one terabyte each month free of charge.)
This new 3TB data set includes information on “more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions,” Arfon Smith, program manager for open source data at GitHub, wrote in a blog post.
To get people started, Smith has put together some starter queries. Felipe Hoffa, a Google developer advocate who focuses on BigQuery, has put together some tips for working with the data sets in a Medium post.
The data set could be useful to anyone who want to get a sense of trends in open source software use on GitHub, and it’s simpler than tinkering with the GitHub application programming interface (API). For sure, GitHub, with more than 15 million users, isn’t the only place where open source software lives on the Internet — see also GitLab — but it is a very popular one, perhaps the most popular.
GitHub will update the data set every week, a spokesperson told VentureBeat in an email.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more