Google is finally debuting its new search indexing infrastructure, cutely dubbed Caffeine, in hopes of keeping up with rapidly changing, interconnected and increasingly real-time web content.
The new system aggregates more pages than ever before and has halved the amount of time it takes to index them, meaning even newer, more up to date content for users, the company says. It has even created a Caffeine logo representing the fast-flying flurry of information that its traditional page-rank system is no longer sufficient to parse.
In addition to content being published faster than ever before, there are more content streams to be concerned about. The number of indexed images, videos, news articles, tweets, and social network status updates is exploding — and Google seeks to separate and make them individually searchable for user convenience. This presents a multi-faceted challenge.
The trend toward real time information has also presented a problem for Google to solve. Twitter and Facebook especially are speeding up people’s conceptions of the web, making it mandatory for Google to offer some real-time results that are still easy to find and sort.
Google’s old indexing system — which made it famous and blew Yahoo and Ask.com out of the water when it launched a decade ago — took a layered approach. Top-tier content (determined based on linking) would be refreshed more often than lower-tier content. But each one of this refreshes required Google to review all web content, and discover and rank new pages. This obviously required a lot of time and computing power.
Caffeine breaks this task into bite-size pieces. Instead of analyzing vast swaths of the internet with every update, it continuously looks at much smaller portions, re-indexing content along the way. This means that recently-published pages are added much sooner than they used to be. It’s a bit like how they paint the Golden Gate Bridge — a little at a time from one end to the other, and when they’re done they start from the beginning again.
The new indexing system — the biggest change to the search engine’s methodology in four years — has also gotten a power boost. It’s now capable of adding hundreds of thousands of pages into the Google index per second, and hundreds of thousands of gigabytes of information per day. All told, Caffeine is operating off of 100 million gigabytes of data storage.
Google has been talking about an indexing revamp for a while and says that Caffeine has been in testing since August of last year. Initially, it was supposed to go live soon after New Year’s, but this never came to fruition. It’s uncertain what held it back until now.
Notably, the search giant couldn’t help but take a small stab at its frenemy, Apple, boasting that Caffeine’s storage capacity is equivalent to 625,000 of the world’s largest iPods, which would extend 40 miles if laid end-to-end. Sort of a sad attempt to make Apple sound analog (especially considering yesterday’s iPhone 4 news orgy), but it sounds like Caffeine is on the right track.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.