GitHub wants AI to help developers code

GitHub is used by more than 30 million developers around the world and hosts repositories for some of the biggest ML-driven open source projects on the planet, but is perhaps less well known for the creation of AI-driven tools to help them do their jobs. That's starting to change.

VentureBeat sat down with GitHub senior data scientist Omoju Miller to talk about how one of the biggest homes for developers online is performing applied machine learning research to create more AI-driven services.

At the GitHub Universe conference Tuesday, a number of major upgrades were made to GitHub and GitHub Enterprise services for businesses. Miller also spoke during the keynote address about Experiments, a new GitHub initiative to explore the use of AI and machine learning meant for developers.

The first Experiments prototype named Semantic Code Search launched last month.

This interview was edited for brevity and clarity.

VentureBeat: Is Experiments entirely AI focused or more of a place for experiments happening internally at GitHub to be shared with the community?

Miller: For right now it's probably mostly AI focused as something that's happening within platform.

Our first experiment is Semantic Code Search.

There are other prototypes we're going to be bringing to the platform. We haven't decided yet what things we want to work on. I mean, we're working on several, but which ones we want to bring next? It's going to be a series of maybe like two, three, four of them a year. It's just published, applied research like this is what we're doing right now.

VentureBeat: GitHub is a unique organization with a lot of knowledge about tools for the developer community and their needs. What are some things you expect AI products coming out of GitHub to be able to provide to developers? What are unique services only GitHub can make?

Miller: Since we have a lot of open source, we have lots of code, there are so many things we can learn about how to write code more efficiently that we can bring back to the developer.

Another thing that we can do is to allow people to use each other's code better.

Right now a lot of the things we write is English facing, so documentation you see is in English, and there are developers are all over the world -- 80 percent of our users are from outside the United States. If we can use AI to help translate some of our documentation, it creates more accessibility to different kinds of code. So it's easier for me to consume code written in Python, but all the documentation is written in Cantonese, so if I can translate Cantonese to English then I can really use that code.

VentureBeat: Because it's the same [programming] language.

Miller: It's the same language; however, what's the intent? What are the limitations? Like, if it's something new you've never seen before, you can read the code, but it's a lot faster if you read the documentation to know all the things you can use it for. And even as you're reading the code, sometimes you're like: Why did they do this? There's a comment to the code, but the comment is written in a foreign language. Just translating those comments makes it a lot easier. That's something that GitHub is uniquely positioned to do.

VentureBeat: Well, Semantic Search is the first one. Can you tell me a little bit more about that? I know you went into it onstage.

Miller: Our semantic search is actually entirely open sourced at experiments.github.com It's a sequence-to-sequence model that translates from natural language to code using mostly docstrings, but it's basically embedding space where we map natural language to code. But the entire thing is actually available, and you can go through it line by line by line.

VentureBeat: It sounds like you want to spend some time listening to the signals and feedback you're receiving or feedback from community for some of these experiments. What else can you tell me about the vision for how AI will be used on GitHub?

Miller: So there's a reason why machine learning is embedded in the platform team. It's because we see GitHub as a platform, and we want to bring these AI-enabled capabilities to that platform because we interact on so many levels. We interact on code, we interact on issues, we interact in pull requests, we interact with projects, there's diffs and all these things -- all that data is what we want to bring to you, and so we want to create this search experience that goes on multiple levels, because then you can bring something to the capabilities of the platform.

They could just do similarity [search], like "Can you find me a piece of code that is similar to this piece of code?" For example, I write Python, and perhaps there is a Java library somewhere that I need to engage with but I don't know Java very well, so instead of me going to sit down and learn Java, if I can just be like "Here's Python code, can you" -- using our API; this is ultimately in the future, we haven't put this on the platform -- "find me similar code that does the same thing in this language?" Those are the kinds of things, because once we have that whole graph mapped out, these are the kinds of things that you can do.

You don't even necessarily have to do translation from language to language. We could just find similarity: "Oh, this is how you do the same thing in Python and Java and Ruby and this." That's just one example.

Basically what we're doing is bringing primitives and serving the primitives very much like the same approach of Actions: What's the primitive, and then it's up to users to do whatever they want with it. I can't even imagine all the things that people are going to build with it, but I can just think of a few use cases that would automatically just use. My immediate one would just be translation.

VentureBeat: I'm starting to think of some popular AI services rolling out elsewhere, and for some reason the Gmail experience where it completes your sentences comes to mind. Obviously there's a lot that can go into writing a single line of code, but some instances seem like they could be predicted. Could you see a point where in GitHub there would be some sort of predictive elements, deeper tie into code?

Miller: Yes, absolutely. At a sentence-to-sentence level like line-to-line, yes absolutely. Like there's some things we do that are just so repetitive, and so we understand that primitive. There's no reason why you need to literally finish this. It's a for loop. Once you start typing the for loop, we know it's a for loop. If you just hit tab and the rest of the for loop is there, then you fill in the part that you need.

VentureBeat: How is AI used on GitHub today? What services are available for developers on GitHub, either for researchers or people who are building things?

Miller: Well, one of the very first major AI ships was topics. So in GitHub today we give you automatic suggestions to tag your repositories with topics, so if you build a repository you can tag it with things like data science, machine learning, Ruby, or something like that.

VentureBeat: Predictive suggestions for tags to be placed on a repository, yeah.

Miller: And that helps with discoverability because [there are] so many repositories on the platform, discovering them based on the things they do is hard. So if we can get our users to help us with that problem by tagging their repos, then it makes discoverability slightly easier. Another one that we worked on was security vulnerability alerts; so, understanding security vulnerabilities in Python, in Ruby, a part of that requires machine learning. Like, "Oh, this Ruby gem has a vulnerability alert that has been fixed, and this one," so that kind of thing, we use ML for that.

VentureBeat: To recognize if there's an issue with the code?

Miller: Not necessarily. Since we have all this data, we can see there are CVEs that are published, and then we can do certain kinds of predictions: "Oh, this looks like code that may have a potential security alert."

That is not production ready -- that is a prototype that we're playing with right now, so that one is not even anywhere near showtime -- but that's the kind of direction we're going with that.

One that we've launched publicly which we did last year was the discovery dashboard, which is a recommendation engine based on follows data as well as based on your page views, so we can serve you interesting repositories, interesting projects, hopefully at a time when you would like to do stuff.

So those are just examples, but our roadmap has a whole lot more coming, and the kinds of things we're working on require like a year or two years or three years of scale to production, because at our scale, we can build a lot of things very, very fast, but we have to make it's robust, and scaling our infrastructure requires time.

VentureBeat: Are there any specific fields of AI that GitHub wants to get deeper into? I look at a lot of computer vision stuff but I don't really associate...

Miller: We don't really use computer vision because our dataset isn't images. Our dataset is text, and we do representational learning, learning representations of data, and our data is natural language and programming language for machine learning on code. That's what we're doing. We study how humans speak, how humans acquire computation, how humans work with programming languages to achieve computation, and everything is text.

VentureBeat: Are there any other projects out there you would say helped inspire this initiative, or somebody else who has done it right?

Miller: This area that we work on is at the cutting edge and it's niche, so not that many. The community is quite small because there's not that many places in the world that would have that level of scale of code that would be able to do that kind of machine learning or even have a need for it, and so therefore the community is rather small and it's still somewhat nascent. So we're all at the beginning of what that's going to look like.

More