Internet search startup Diffbot launched its API today for visually scanning, parsing and extracting information from web pages. Diffbot detects what type of layout a page has, then searches it for common visual cues to monitor when any content changes on a page, or to extract specific information for developers to use.
The Palo Alto-based company was founded in 2008 by two former Stanford students, CEO Mike Tung and CTO Leith Abdulla, with seed funding from Stanford incubator StartX. Tung originally created Diffbot to monitor the websites for his various Stanford classes and tip him off to any new announcements, posted lectures or assignments via text message.
According to Diffbot’s creators, all web pages fall into one of 30 different page-type categories. By pegging what category a page falls in to, it can extrapolate the various types of information on that page. For example, front pages of news sites typically have the same elements: headlines, images, tags, advertisements and article summaries.
“Diffbot understands visually what all of these different elements of the page are and can be used by developers to connect that content to direct action,” Tung told VentureBeat.
Currently Diffbot has hundreds of developers using the beta API, and some intriguing products have already been created using the tools. AOL’s free Editions iPad magazine app uses Diffbot to analyze the front pages of news sites and pull out important new or breaking information. Hacker News Radio tapped Diffbot to pull content from hacker news sites and turn them into spoken reports. The city of São Paulo in Brazil uses Diffbot to track changes on the local government website and turns them into an automated Twitter feed.
Diffbot’s Follow API creates an RSS-style feed of fresh content. The On-Demand API currently looks at just two major page types, Frontpage and Article, but Diffbot plans on releasing more in the future, and that’s when things could get interesting.
“Once we have released the API for all 30 page types, we hope to enable a new type of mobile application — one where the user can take actions directly on web data, instead of reading a bunch of blue links,” said Tung to VentureBeat. “Something like SIRI, but for the entire web, and not just a set of handpicked APIs.”
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.