Open data is the future of web discovery

Twitter cofounders have talked about the importance of discovery in interviews and at conferences over the last several months. This week a new design for Twitter.com went live featuring top tweets and a search box to find more of what you want, but Twitter and many other web companies could improve discovery much more by incorporating other players' data.

Also, a year and a half ago, Google vice president Marissa Mayer said that social search is part of the future of search. Now, the question is what data can help make social search and discovery advance faster.

Think about the data that represents everything you do online, including web visits, searches, ads clicked, purchases, time spent, location, etc. Web products like the Google browser toolbar return data to Google about the websites you visit. Browsers like Chrome, Firefox and Internet Explorer can get even more data about what you do. For this piece, I'm referring to all this toolbar, browser, search and email data as "toolbar data" for short.

What you typically discover on Twitter and Facebook is limited to your connections and what you search. More, better data is needed to learn about what you're missing. You might have a lot of interests – sports, music, technology, books, movies, TV, food, travel, etc. – and things happen around you and around the web related to them that you probably want to know about. Surprise concert by your favorite band tomorrow night? New travel website? Cutting edge phone being released? We don't even know how much we're missing until we see it.

With more data, developers could build services or apps with toolbar data to see what's hot now, this week, month or year for any thing broken down by age, location and more. One app might focus on the most popular content about travel to Asia based on unique visitors to specific web pages and the number of links shared by email or social networks. Another app might cover the most engaging communities online based on growth in time on particular parts of each website compared to peers. The data could look at user session activity across sites and specific content on web pages. In contrast, Google Hot Trends only reports on search terms and typically free data from analytics services like Compete report only by website unless you pay for web page level reports. Entrepreneurs could use the toolbar data to identify unmet needs, then build products and services to meet them. Without a more complete picture of the data, it's hard for entrepreneurs to know what users really want.

Twitter cofounder Biz Stone recently said that ranking the authority of tweets is needed to surface the important tweets. Already the Twitter feed can quickly become overrun with fresh content, causing you to miss tweets you may find interesting. Users not on the site all day could benefit from a summary of the best tweets before seeing the real-time stream while on the site. Surprisingly, I haven't seen a developer using Twitter APIs has solved this problem, unless you count Twitter search companies that filter tweets in search results. But some third-party developers for Twitter tell me they could benefit from more data about what content each user cares about, such as the number of impressions and clicks on links by which users, as well as time spent on different pages and locations of the users. At least part of that data is held by Twitter, Twitter clients and URL shorteners.

The data people want access to

With users sharing more and more on Twitter and Facebook, there are billions of statuses/tweets projected to be set each year. To try to capture this opportunity, dozens of Twitter search start-ups have sprung up over the last year. Big companies like Yahoo and Microsoft have released Application Programming Interfaces (APIs) that let developers access their web index and rerank results. But these services can be limiting. Google APIs, for example, do not permit developers to rerank results yet.

The hundreds of millions of people using Google prove that web content that's not all real-time is, of course, still very useful. The next step after continuing to improve real-time search is for Twitter search providers to figure out how to do relevance ranking over different amounts of time. Beyond searching Twitter data, there might be an opportunity to search the 4 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared on Facebook each month, with degree of access depending on how users transition their privacy settings using its upcoming privacy transition tool and whether Facebook makes a search API available like Twitter.

The availability of tweets/status messages to find using real-time search is limited by when and how much people choose to share. Sometimes there's a delay between when an event happens and when people start to share about it online, but first they're searching or clicking on content they want to share. Google benefits from seeing what people are searching and browsing around the web in part thanks to the Google Toolbar, Google Chrome, and other data, too, like Gmail and search history. For now, at least, Google has a data advantage and so might be better than most for telling you what's new around you. That said, this data gap isn't insurmountable for Facebook and Twitter. Either could eventually acquire toolbar-style reach by building large enough businesses around search or creating services that make users want to share their full data. They could also partner with a company like Yahoo or Microsoft to get the data.

Then the data disadvantage could go away and actually flip to Facebook's advantage, in particular because of Facebook's massive and growing user base, network effects and all its data not indexable by Google. Google creating a social graph as popular as Facebook's is actually less likely than Facebook acquiring toolbar or equivalent data. If Facebook eventually acquires the data, makes the data available to developers and helps commoditize search (assuming that data can be used to replicate and/or improve upon Google's advances), then where would that leave Google?

About the Google data advantage, someone close to Facebook tells me: "The framework I usually use for a discussion like this is 'explicit' vs. 'implicit.' One real danger here is that a lot of the data you will get from implicit browsing is searching and looking without success. While you can help narrow and get a sense of someone’s interests, you can’t necessarily predict the next best matches. The reason that Tweets/Facebook (and even PageRank) are interesting is that they try to actually take an explicit affirmative action such as a share or hyperlink and create useful data and relationships on top of the core data – and especially from real people such as your friends or influencers."

I'm told Google has a significantly higher number of toolbar users than Yahoo and Microsoft, but Microsoft Internet Explorer still has the largest browser install base. The number of active Google toolbar users is a closely held secret, and I'm told a considerable number of users turn off the tracking feature, but that the number of people with tracking on is large enough to get a picture of what's happening on most sites. Of course the exact numbers have a big impact on how much data these companies are actually collecting from users. Google made available Google Hot Trends a while ago to show trending searches for any given day and clicking a search term shows web and news results, but it’s relatively uninformative compared to all the data Google knows about what’s happening on the web from constantly crawling the web and data coming from services like Chrome and the toolbar. Compared to a Facebook or Twitter feed of content like news stories and status messages, a list of links to popular search terms not necessarily related to your interests just isn’t that interesting. Google Trends for Websites at least shows websites also visited, but it doesn't show trends for specific web pages.

Already Google indexes the web in near real-time, and some former Googlers say the company could easily have a real-time view into most of the web already. Imagine if Google made it easier to discover all the data it already knows about specific web pages or content. Think of all the different kinds of content out there, and get ready for a long list of possibilities. In terms of content, there's trending web pages and websites, products, news, photos, tweets, status messages, comments, blog posts, books, games, searches or any other web content. You could then potentially see trends for each piece of content, organized by types of data like unique visitors, type-in, upstream and downstream traffic, referrals, purchases, link sharing, clicks, demographics, user profiles, location and proximity to you, audience also searches for/likes/visits, traffic frequency, daily/weekly/monthly unique users/page views/visits, business activity, time spent, etc.

Imagine if you had control over filtering the content you discover using all the types listed above such as trending purchases or time on site. Think of it as data you see on comScore, Quantcast, Compete, etc. but a level deeper because it's about individual pages updated in real-time with the sorts of filtering options above. Google could share what's happening around you without waiting for users to set more status messages or type searches into the search box. The result could be a feed of what’s happening now or what’s happened ranked over any stretch with most anything online for most anyone around the world. This map of human activity could encompass most of the activity on Twitter and give a clearer picture about what’s done all around the web rather than just the subset of topics that tend to be most represented on Twitter.

How does this apply to everyday life?

What might the pairing of real-time with more traditional search data look like? One area is discovering trends. Maybe you like tech, so you look up the top ten most popular web pages (news, start-ups, Twitter accounts, etc.) in the category technology or consumer internet looked at or shared by people in Silicon Valley over the last day. Maybe you’re thinking about visiting New York City and want to find the hottest restaurants, so you look up the most trending popular Yelp restaurant web pages over the last couple months visited by, shared or commented on by people living there. Maybe you have a favorite blog or Twitter account you follow, but you want to discover content that people like you are checking out, so you look at the top 10 trending web pages viewed over the last few weeks by people who also frequently visit the blog or Twitter account. Maybe you want to find something to do for fun, so you look at games that have the highest level of time on site growth in the last few months. Developers could think of any number of combinations to make available to users that’s easy to use or offers customization options.

An engineer who has been focusing on search tells me that academics often want search log files because they want all the information to derive their own statistics. This requires rigor and sophisticated techniques the researchers like to use, but the downside is that this data can be noisy. For developers, he said the search provider could aggregate the data into summaries and make those available instead of the raw data. This is not only more usable/efficient, but can protect users' privacy better than raw data because the data is presented as summaries rather than web history tied to an individual that can still expose identity. If interesting data that’s easily accessible is what’s driving all the developer activity around the Twitter API, then how many more multiples of the developer activity might occur given toolbar data that covers significantly more activity around the web in addition to tweets, which are a subset of what people care to talk about publicly?

What do people in the industry think about the potential of a toolbar data API?

Fred Wilson, the VC at Union Square Ventures who has backed companies including Twitter and comScore, said this about a possible toolbar data API: "It would be great. It's not likely to come from the analytics companies because they sell their data. Quantcast seemed to be headed in an advertising direction which would have made this approach workable for them. But lately, it seems they are looking more and more like Compete and comScore."

Othman Laraki, a former Googler who worked on Google Toolbar and is now cofounder and President of TownMe.com, said: "It's definitely an interesting area and something that could prove to be an incredibly valuable resource for people developing new services. Particularly, now that both search engines as well as other services are increasingly focused on surfacing real-time information, offering an API that makes it possible to shorten the feedback loop could be a game-changer. Whereas it has become easier to analyze traffic after the fact (i.e. when a service addresses a need, one can more easily understand why), what is still difficult is discovering the untapped opportunities (i.e. when consumers have a need that is not yet satisfied). Moreover, opening up the lower-level data would likely bring a great deal of creativity to the game. An example that comes to mind is the Twitter's API. Twitter's openness has enabled numerous interesting applications that in turn have made the data more valuable in the first place."

Satya Patel, a former Googler and now a principal at Battery Ventures, said: "I just don’t seen any company that has toolbar data making that data accessible to others. It’s too valuable and too much of a privacy concern. If you think about it, the data that Google has from the Google Toolbar and Google Analytics is incredible and basically can create a real-time map of the web. There are all kinds of interesting applications for this data but I just don’t see any benefit to Google and others to opening up this data. Twitter is different because it has already conceded some of this data, to Bit.ly for example. There are many sources of web usage data so it will be interesting to see how this market evolves and who really creates value for consumers or businesses based on this data."

A former long-time Googler said: "Toolbar data is super sensitive at Google. I think it would be unlikely that Google would ever share this data. Google doesn't want additional monetization at the expense of risking user trust/privacy on toolbar data. Larry is rightfully very concerned about defending the user's privacy (and for Google it is economically advantageous to do so). I think if sharing toolbar data were ever to happen it would be very far in the future and likely only because something about the infrastructure of the Internet changed so dramatically that all of the browsing habits of users became transparent or otherwise generally visible."

Konrad Feldman, chief executive of Quantcast, said: "Many people have toolbars for discovery, such as StumbleUpon. I think the static data is kind of interesting, but of course real-time trending coupled with collaborative filtering is where you really get the ‘aha’ moment for discovery." He also noted the company recently launched a new media program that lets advertisers make a profile of the type of users they want to reach, and then match this target group in real-time based on the 6 billion real-time media consumption events the company observes every day.

Saar Gur, partner at Charles River Ventures, which is a Twitter investor, said: "A number of valuable discovery services (measured by CTR) have been built leveraging cookies (e.g., online advertising, personalization and analytic services like Quantcast). A number of valuable discovery services have been built leveraging social data (e.g., Facebook, Twitter, Yelp, etc.). That being said, for a number of reasons developers have been unable to leverage the much richer data set that is captured by browsers and toolbars. With all the attention around bit.ly or ShareThis/AddThis as Digg killers, think of the services that could be built combining social data with actual page-level consumption data (e.g., time on site, pages within a site that are not cookied). As an example, the initial add-on market on Firefox is very interesting but doesn't really leverage the most interesting data that Firefox can capture."

Martin Green, Chief Operating Officer at Meebo, said: "I think an easy way to think about the ultimate user benefit is to see if social content data APIs can do for web content what Amazon delivers for products. I love shopping at Amazon because it knows my historical interests (purchases) and those of others, and it matches that history with the current inventory and usage trends and associations to make a series of relevant suggestions every time I go to the site or search for a product. I would personally love a content recommendation service for content discovery for the web 'right now' that is built from a combination of knowing my interests and taking advantage of the social filtering process from people I know and people who share my interests, and who’ve seen something relevant to me before I have discovered it."

Vipul Ved Prakash, CEO and founder of Topsy, said: “The types of higher order analysis that can be done using the visitation logs for discovery/recommendations/trends can be very valuable. We've built Topsy to process streams of events about the web - so we'd likely be one of the consumers of this data. The field of anonymizing network traces is an active area of research these days (motivated by the need for security researchers to share logs without compromising privacy) and that work might be applicable here. A good method for anonymizing traces is considered to be prefix-preservation, which has better privacy than one-to-one anonymous mapping (AOL did one-to-one) and more signal than completely anonymous. That said, completely anonymous is still extremely useful and perhaps more practical.”

Gregg Poulin, General Manager of Compete.com, said: "You are right on with toolbar data (we call it clickstream) being opened up and the benefits of ‘crowdsourcing’ it. You should also think about how Deep Packet Inspection could work here. That (currently) is very ‘black box’ and focused on delivery ads to people showing certain behavior but, I think that same technology could be opened up." The Compete API lets you access data on the website level, but not on the web page level.

Liad Agmon, founder of social search company Delver, said: "It’s extremely valuable, not only for the developer community, but for the general web community. There has been constant debate on how accurate are Alexa and Compete (which also use toolbar stats for their data), and getting access to row data could be amazing."

Adam Boyden, president of Conduit, which powers community toolbars used by 200,000 publishers and 60 million monthly active users, said: "We think there is a huge trend to have information dynamically updated in real-time while sitting persistently within a user’s browser. More than that, we think you have picked up on an even bigger trend as users want to be able to customize the information they receive and content owners are finding they have to cooperate with other providers to give the best experience possible. In other words companies need to cooperate with each increasingly to thrive. We developed the Conduit Open marketplace to help with this trend by allowing any content owner to offer their information to be easily added to other content owner’s toolbars and also allow end users to customize part of their toolbars as well." The company launched the open marketplace a few weeks ago, which lets companies offering Conduit toolbars share features and choose to include features created by others. Boyden said that any developer could create a discovery tool, then the owners of the toolbar could choose to make it available to new users, or give existing toolbar users the option to activate it provided they clearly understand and consent to whatever data is shared.

Mark Cramer, chief executive of SurfCanyon, said: “Data is extremely powerful, which is why those who have it are normally reluctant to give it up while those who don’t are excited about all the possible things they could do with the data. We formed Surf Canyon with the goal of building a technology that would re-rank results ‘on the fly’ as people search. Building the underlying search from scratch would have been prohibitive, but we’re able to benefit from residing in the browser as an add-on. We have also repurposed our technology to cull out the more relevant Tweets. We could similarly see integrating data from other sources, such as Compete, Quantcast, etc. and then even exploiting this data when re-ranking. (For example, popularity or freshness could help influence relevancy, in either a positive or negative direction.) The backend is built on Yahoo! Boss and Microsoft SilkRoad.”

Tobias Peggs, GM of OneRiot, said: "OneRiot’s user panel is a key advantage in real-time search. It enables us to deliver broad-based real-time search results by harvesting both explicit social activity on Twitter, Digg and other services in combination with implicit data from over 3 million users who have elected to join our panel. That data helps inform our PulseRank algorithm – PageRank for the real-time web – ensuring that our results reflect what’s relevant right now in relation to your query. OneRiot has a simple search API today that is handling millions of searches a day for partners like Microsoft and Scour who are delivering our real-time search results to their users. Soon we’ll extend the API to offer a deeper view into the data that we have around each piece of content in our index. Opening up this data (while respecting our users’ privacy) will give the developer community lots of opportunity to build some very creative real-time applications. API partners will now be able to get rich social meta-data on content from across the web, including information like “dwell times”, to number visits, to the PulseRank on any individual url, all in real-time."

Cyril Moutran, cofounder and chief executive of Twazzup, said: “Analyzing data stream from toolbar could indeed provide great insights. More generally any user activity stream that can be sliced either by topic, user segment and/or location can provide powerful insights. Communication tools like Twitter capture not just individual user actions, but also the propagation of links and messages through social graphs. Analyzing this propagation (how, how fast, where, who is involved) is the source of great insights.”

Privacy, and other issues

Facebook and Twitter have grown massively by taking what people share in posts and status messages and making it easy to share and consume content. A question is if these companies and others might figure out discovery before the companies with the data advantage do. It’s unknown whether building a discovery service is high enough priority or even on the to-do list at the companies holding the toolbar data. At the least, Facebook, Twitter and companies with useful data about Twitter – like Twitter clients and URL shorteners – could help developers trying to improve discovery and search by making more data available about what users care about, including what they click. Andrew Cohen of bit.ly tells me the company has made available an API to enable developers to access info about bit.ly links. Given that these links are clicked about 1 billion times per month, it will be interesting to see what people do with its API.

Someone at Google familiar with Google Toolbar told me that in some cases the company has not found much signal in toolbar data and so looked at links being explicitly shared for help. This person also said that link sharing data gathered from the toolbar is somewhat limited. Word is that links shared by Gmail users has been more valuable. Twitter benefits from having a significant number of links being shared.

As Facebook and Twitter users share more content, they'll benefit from fresh content and data to help users find and discover even more content. Imagine if these companies could access toolbar data from Google, Yahoo or Microsoft and map the information to their graphs of connections. You could get recommendations based on what your connections find interesting around the web. Advanced privacy tools would be required to avoid sharing content based on web browsing data with someone who might connect it back to a friend. For example, a story recommended to you might be about a baseball team. If you have a single friend who is a loyal fan of the team, you could guess that the story was recommended to you because they had read the article somewhere. Before that in case there are still privacy concerns, content filters could zoom out from the closest connections so you’re unlikely to draw connections from info you see and what people you know are doing.

Privacy issues include legal challenges that have stopped companies like NebuAd from using ISP data, problems with users' identities being exposed when sharing raw data like in the case of AOL because users tend to search for their own names and mistaken user expectations about sharing of data as seen with Facebook Beacon. The legal issue is an open question, but summaries of data through an API could be a big step towards avoiding the privacy problems of sharing raw data. Google manages to use toolbar data for powering services without creating user mistrust, so maybe other companies can do so as well.

Those privacy challenges could make it hard for Twitter and Facebook to get the full benefit of their social graph data. That reduces the competitive advantage of Facebook and Twitter because Google, Yahoo and Microsoft don’t have popular, explicit social graphs anyway. However, using social filtering could be a key draw for users, such as looking at what friends, friends of friends or people with similar user profiles and interests as you. This is already evident with the popularity of Facebook and Twitter. Toolbar data aside, Facebook and Twitter have their own user data that could be shared with developers to help them build better ways for users to discover, consume and share on their platforms.

Questions might be raised about why Facebook and Twitter should get complete access to the toolbar data when it's not feasible to let all developers see personal data about each user, but these companies could agree to uphold the privacy policies in place by Google, Yahoo and Microsoft. Special agreements to get data access could also apply to Alexa and other companies that together have millions of users with toolbars. Anonymizing data at scale is hard, but trusted partners could get access to the same data these companies hold. Google already is understood to use toolbar data for Google Trends, the Ad Planner services and hasn't ruled out using toolbar data in search rankings. Facebook, Twitter and other developers could make use of the data as well.

How to get more data

It's not known for sure outside of the companies holding toolbar data exactly how useful it is to search and would be to discovery, but some people I know at data driven start-ups say they would be thrilled to use the data and that it could be game changing. There's a long list of companies that have tried or are trying to get this data on their own using toolbars or some other way including a number of start-ups, Facebook Beacon and many others but so far their toolbar distribution and/or data access ended up as or is a fraction of the major web players. OneRiot uses a toolbar to get data to influence search results. OneRiot gets data from hundreds of thousands of toolbar users each day, and the total number of URLs visited each day is five times as many URLs are shared on all of Twitter each day. OneRiot plans to make data including anonymous user activity collected from the toolbar available to developers. StumbleUpon offers an optional toolbar to help discover content around the web that recently passed 10 billion random website visits. StumbleUpon chief executive Garrett Camp tells me the service has millions of active users and recently passed 600 million ratings set by users across 32 million pages. Word as of last year was that the company was looking at using its data and Yahoo Boss to rerank search results, and in April the founders and new investors bought back the company from eBay. Hopefully Yahoo Boss and other open search initiatives will continue to expand as part of the recent Yahoo-Microsoft deal. But StumbleUpon is the exception as a service that has attracted a lot of downloads. Twitter search and web search developers have to build up user bases to get usage data, which is slowing the advancement of those services.

For now, we need one of Google, Microsoft and Yahoo to make toolbar data available to see if there'd be massive advances by third-party developers. The first company to do this will open a new market, so who will be the first, and why? These big companies could find a way – like Facebook has – to build a vibrant developer community using their data and distribution. Perhaps the data provider can participate in the value created by charging for accessing and using the data. Or maybe Facebook, Twitter (or Twitter developers) will be the first to share click and other usage data. No longer would you have to build a multi billion dollar company to get access to data that could be used to make significant advances for users. All these companies would need to feel comfortable that users' privacy is protected. Googlers tell me it’s highly unlikely Google would release toolbar data because the data is too valuable, there are privacy concerns and users might be surprised to see how much Google knows about them. Maybe the data is too important to Google so the company must hold it close to maintain its competitive advantage, but perhaps Yahoo and Microsoft would be willing to share data with select partners while tightly protecting user privacy for a chance to increase their competitiveness with Google. Or maybe these companies will try to build discovery services themselves.

In the meantime, those potential advances in search, discovery and more are being stifled. Developers using Twitter, Yahoo and Microsoft search APIs could make better services for users with more data from those companies as well as Google, Facebook, Mozilla and others like data analytics companies. The chance these companies will share more data with developers aside, it's worth figuring what would be possible if they did.

[Disclosure: Doug Sherrets owns some Facebook shares and he works for Slide.]

Thanks to Ada Chen, Ashvin Kumar, Azra Panjwani, Chris Messina, David King, Eric Eldon, Eugene Shteyn, Gavin Joughin, Greg Linden, Greg Sterling, Ido Green, Itamar Herzberg, Jack Abraham, Jesse Farmer, Jing Chen, Joe Greenstein, Jon Turow, Josh Elman, Julian Gutman, Justin Smith, Kara Swisher, Keith Rabois, Lars Kamp, Loren Brichter, Rishi Mandal, Ryan Elmore, Sachin Rekhi, Scott Banister, Suhail Doshi, Trip Adler, Vik Singh, Yan-David Erlich, people quoted above and others for reading drafts of this post.