Yahoo today announced that it has released the source code for its Anthelion web crawler designed for parsing structured data from HTML pages under an open source license.
Web crawling is at the very core of Yahoo, even though it has many other applications, including Yahoo Mail, Yahoo Finance, Yahoo Messenger, Flickr, and Tumblr. For Yahoo to share code in an area as competitive as web search is significant.
Of course, it’s coming at a time when Yahoo is planning a “reverse spin-off” of certain core business assets, but not its stake in Alibaba. Plus, Yahoo chief executive Marissa Mayer just had twins, so there’s that. (You could argue that that’s irrelevant to web crawling technology, and I would probably concede that it is, too, but I can’t bring myself to write about Yahoo’s good news without mentioning the company’s latest headlines.) Anyway.
Last year, at the Conference on Information and Knowledge Management in Shanghai, Yahoo detailed Anthelion in a paper.
“To the best of our knowledge, we are first to introduce the idea of a crawler focusing on semantic data, embedded in HTML pages using markup languages as microdata, microformats or RDFa,” wrote authors Peter Mika and Roi Blanco of Yahoo Labs and Robert Meusel of Germany’s University of Mannheim.
Microdata and RDFa are syntax formats for structured data about different topics. They’re compatible with the schema.org vocabulary for structured data, a project that the Google, Yahoo, and Bing search engines all work on.
The authors of the paper showed how an implementation of the crawling technology can offer a higher number of relevant results for certain search queries.
Now the code is available under an Apache license on GitHub, as a plugin for the longstanding Apache Nutch open source web crawler.
“Anthelion can be targeted to crawl for specific pages; for example, those including markup describing movies with at least two different attributes such as the title of and actors in a movie,” Mika, Blanco, Meusel, and Yahoo Research intern Petar Ristoski wrote today in a Tumblr post on the news.