Editor’s note: This story is part of our Microsoft-sponsored series on cutting-edge innovation.
When you search a traditional search engine, like Google, Yahoo or Bing, you are getting a picture of the web as it was yesterday, last week or last month. Nowadays that’s called static search. The search engine crawls the web by following links between websites, indexes those pages, creates a graph of the web, and then let’s you search that graph. Among other things, the authority of that page determines its order; primarily the number of links that page has received over its life (PageRank).
Then there’s realtime search, which strives to deliver the web as it is right now. Realtime search — from providers such as Scoopler, Collecta, and my own company, OneRiot — requires a completely different approach to indexing and ordering the web. You can’t take the batch process approach used in static search, because by the time you’ve crawled and indexed the web, your index is already stale. Even if you used the brute force approach of indexing the whole web every second of the day (and that would be boiling a very big ocean), you’d still be stuck with the problem that older content would rank higher than new content. Yet, often newer content is an improvement on older content. That’s kind of the point of creating new content.
Realtime search is a new way of indexing the Web
When you’re a real-time search provider, every hour of the day there are millions of signals out there telling you what to index on the web. These signals are coming in in realtime. There are two major types of signals: explicit signals and implicit signals.
Explicit signals occur when someone tweets a link, saves it in Delicious, votes for it on YouTube, or Diggs the link. You know who shared the link, and usually there is a small piece of content (like a tweet) associated with the share. The identity of the sharer is very important here as you learn who are the more influential people on the web today.
Implicit signals are actual click-data: how many times was this page viewed. There are many sources of click data, including ISP data, url shortener data, and toolbar data. These are powerful signals that tell us the actual usage of the page relative to other pages on the web in realtime.
Every search engine is slightly different in its approach, but the volume is incredible no matter how you do it. At OneRiot, for example, we process over 30,000 shares per minute during peak load, between 30 and 40 million per day. If you could imagine Google processing its toolbar data in realtime, or Comcast processing its ISP data in realtime, you’d get closer to billions of signals per day. Combine them all together, and then you really start having fun! And because we’re talking about a realtime search engine, you need to move these from signals to search results within seconds. Making them searchable requires fetching the page, determining the relevant content on the page, screening for porn, making sure the page is not spam, and extracting any images or videos. In seconds after receiving the first signal, all pages shared become searchable.
Realtime Search is a new way of ordering the Web
But realtime search isn’t just about searching the newest content, it’s about getting to the most relevant content. Once a page is indexed, the order of the page needs to remain fluid. A page that is the top result one minute can become irrelevant the next minute. A perfect example of this was the day of Michael Jackson’s death. Realtime Search engines were able to elevate the TMZ article to top of the results within minutes of it being posted. It’s done by watching the acceleration of the page’s shares and changing the order of results in realtime. The minute after he died, the most relevant result had fundamentally changed. A realtime search engine re-orders millions of pages every time you click the search button. No batch processing, no old results, all realtime.
In the screenshots pictured, you’ll see the difference. The query is ‘Google Wave’. In the realtime results you get how-to videos, the latest reports on Google Wave, and a very popular spoof on Google Wave. In the traditional static search results you get a very authoritative view into Google Wave, but it’s closer to what you’d get if you were looking in a library and doesn’t reflect what’s actually going on on the web.
Realtime search: the future
While realtime search is focused mostly on the social web right now, over the long term it will expand to include the rest of the web as well. From basic research to long tail searches, you will always be better off seeing a current picture of the web. In order to fill this need, realtime search will have to do everything it already does, but do it with billions of web pages instead of millions.
Funny thing, though, is that less than a year ago we saw less than 1% of the shares we see today. Twitter is exploding in growth and has become the dominant explicit signal. The url shorteners are capturing amazing data on click activity, and growing by leaps and bounds. What will happen when Facebook releases an API for its share activity? We could see 10 times the share activity appear over night. The day that Google processes its toolbar data in realtime will fundamentally alter its search results. And most exciting to think about is what the next Twitter will be.
The “Twitter-like” product that enables academics and universities to embrace realtime sharing will power realtime search and discovery of research papers around the world. Do a search for “Machine Learning” on Google and you’ll get papers that range up to 15 years old in your results. Have there really been no advancements in Machine Learning since 1997? Or is it simply that those papers have built up such authority over the past decade that all new research gets ignored by static search algorithms? Realtime search will tap into the community’s knowledge sharing and reorder results to give you an up-to-date, socially relevant set of results.
This is just one example of realtime search’s potential. Data is becoming more and more public. The Twitter of tomorrow is going to take us even further down the path of open sharing. As this data becomes available and realtime technologies get better, there will be a day when you will be able to search the entire web in realtime. And I’m betting that day is much closer than any of us think.
Kimbal Musk is the CEO of OneRiot, a realtime search engine that finds the freshest news, videos, blogs, and websites that people are sharing online right now. He is an entrepreneur who has helped found, advise and invest in several software and technology companies. In 1995, Kimbal started Zip2, an early internet content management company, with his brother Elon Musk. The company was sold in 1999 to Compaq for $307 million in cash, one of the largest transactions of its kind in the internet industry. After selling Zip2, Musk was an early investor in PayPal and in October 2002, PayPal was acquired by eBay for US$1.5 billion in stock. Kimbal is also chef-owner of a restaurant, “The Kitchen” in Boulder Colorado, which has been named one of, “America’s Top Restaurants” according to Food & Wine, Zagat, Gourmet, and the James Beard Foundation. Kimbal is a graduate of Queen’s University in Canada with a degree in Business Communication and the French Culinary Institute in New York City. Kimbal currently sits on the board of directors for SpaceX Corp. and Tesla Motors. You can contact him at firstname.lastname@example.org and follow him on Twitter @kimbal.
[Image at top of story from Ron Niebrugge’s photo blog.]
Also see previous stories in our Conversations on Innovation series:
Twitter-like services find traction in the enterprise
Healthcare: It’s time for technology
The new healthcare: Smart band aids, digital pills, wrist bands
Not everyone’s ready for the cloud: 8 roadblocks software developers face
Is it time for business to embrace the cloud?
Speech, touchscreen — been there, done that. What’s the user interface of tomorrow?
How phones emerged as main computing devices, and why user interface will improve
Put your finger on it: The future of interactive technology
“Touch” technology for the desktop finally taking off