Google today is open-sourcing SyntaxNet, a piece of natural-language understanding (NLU) software that you can use to automatically parse sentences, as part of its TensorFlow open source machine learning library. The release includes code for training new models, as well as a pre-trained model for parsing English-language text.
The parser, which goes by the name Parsey McParseface and can automatically figure out whether a word is a noun or a verb or an adjective just like your third-grade English teacher, is the most accurate one in the world, Google says, beating out its own technology. And it’s nearly as accurate as human linguists. So this is a big deal in the world of natural-language research.
But this is also important for Google itself.
“The way we evaluate technologies internally is actually pretty different. We care much less about benchmarks and much more about how they impact performance of downstream systems. Our goal is to improve user experiences,” Google Research product manager Dave Orr told VentureBeat in an interview at Google headquarters in Mountain View, California, earlier this week.
Like TensorFlow itself, SyntaxNet is primarily executed in C++. Now that it’s available outside Google’s walls, the code stands to be improved by people outside the community, which could help the company find new talent as well as bring about Google product improvements. Generally speaking, language parsing is relevant for product reviews — app reviews, restaurant reviews, shopping reviews — as well as Internet searches and the Google Now On Tap feature that’s part of Android Marshmallow.
“It’s really important because language is sometimes subtle, and it’s not necessarily straightforward to understand what people are saying, and some things are very contextual,” Google Research team lead Tania Bedrax-Weiss told VentureBeat. “When you’re talking about a crash, it matters whether it’s a car crash, or in an app, or somebody’s just tired and they say, ‘I’m crashing.’ All of these contextual meanings for the word ‘crash’ are very subtle and require quite a bit of understanding. We can actually start training on this data and doing really interesting things.”
Compared with more traditional machine learning methods, a type of artificial intelligence called deep learning has proven to be more useful for language understanding at Google, Orr said. The approach generally implies training artificial neural nets on lots of data — like, say, Google searches — and then getting them to make inferences about new data. Google has employed deep learning for image recognition and speech recognition, and now it’s clearly showing gains in the world of language understanding. Indeed, neural nets are key in SyntaxNet, which has carried the codename “neurosis.”
For more on SyntaxNet, check out the library on GitHub as well as the corresponding Google Research blog post.