Google gave an inside peek into how web search works today, revealing some fascinating numbers in the process.
Search starts, of course, with crawling and indexing, and Google says that the web now has 30 trillion unique individual pages. That up an astonishing 30 times in five years: Google reported in 2008 that the web had just one trillion pages.
Google says that it stores information about those 30 trillion pages in the Google Index, which is now at 100 million gigabytes. That’s about a thousand terabytes, and you’d need over three million 32GB USB thumb drives to store all that data.
When you search, Google tries to figure out not just what you’re typing into the box, but what you mean. So algorithms for spelling, autocompletion, synonyms, and query understanding jump into action. When Google thinks it knows what you want, it pulls results from those 30 trillion pages and 100 million gigabytes, but it doesn’t just give you what it finds.
First, a ranking procedure uses over 200 closely guarded secret factors that look at the freshness of the results, quality of the website, age of the domain, safety and appropriateness of the content, and user context like location, prior searches, Google+ history and connections, and much more.
Then, in just over an eighth of a second, Google then delivers the results to your computer, tablet, or phone.
To test how well its searches are actually performing, Google also uses real-live humans: search evaluators. Forty thousand times a year, Google’s search testers check results, see what’s working, and provide suggestions on how to improve.
And what about web spam?
Web spam is useless pages that are crafted to rank well on Google, draw your attention and clicks, and then monetize your eyeballs or clicks off to somewhere else. Google said that it notifies sites that it considers them spam, or that they have been hacked, at a rate of 40,000-60,000 per month.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.