wikiasearch.jpgSearch Wikia, the highly anticipated search engine by Wikia, the for-profit company of Wikipedia co-founder Jimmy Wales, will launch publicly on Monday. It is currently in private testing mode, and we’ll write more upon launch.

The huge success of Wikipedia in mobilizing humans makes this project particularly notable. It’s a fascinating alternative to Google’s computer-focused approach.

We’ve tested Grub, the service’s way of crawling the Internet’s web sites to collect data. Grub is a “distributed search crawler,” so named because it lets people download a software to do the crawling from their own computers, thereby letting thousands of people contribute to the process. It is intuitive and easy to use. However, large questions remain about the ability of Search Wikia’s approach to scale to the entire Web.

Wikia, the parent company, already has a live service independent of its Search Wikia’s efforts. The current site hosts free wikis — areas of the site open for collaborative editing — for communities in an ad-supported model. The resulting topics covered are usually deeper in detail than the average Wikipedia article. Wikia’s wikis use the same software that powers Wikipedia.

Search Wikia will borrow the ideas and principles that have made Wikipedia so successful–strong community emphasis, transparency, freedom to contribute and free licensing.

Search Wikia’s lofty aspirations of transparency raise some very important questions about the ability of spammers to manipulate search results. All the major search engines guard their ranking algorithms closely in order to prevent such manipulation. It’s clear that Search Wikia will rely on the same kind of community monitoring and self-policing that have made the fully open Wikipedia increasingly popular in spite of the same threats. According to a story by New Scientist, Jeremie Miller, the search project’s technology head honcho, the search service will integrate wiki-like tools to improve search. The ability to vote on search results is an example of such social tools.

Search Wikia will also rely on a cadre of volunteers to help it crawl the web with the Grub distributed web crawler. The Grub client is a consumer desktop application that harnesses spare CPU cycles on volunteers’ machines and crawls a small portion of the Web. The New Scientist article informs us that the January 7 launch product will have an index of approximately 100 million pages. Given the size and scale of the Web, this is a relatively unimpressive number and quite possibly not big enough to cover one vertical (say, Sports or Health), much less the horizontal universe of queries that any general search engine must be prepared to handle. That being said, widespread usage of Grub by hundreds of thousands of volunteers and an index that actually scales to the Web would be a disruptive development and a new way to think about search.

We tested the Grub crawler client (screenshots below) on a dual core Lenovo ThinkPad T60 laptop running Windows XP. The download and install process was a snap even though the Windows client is running in TEST mode and is expected to be buggy. We ran Grub using a Comcast cable connection for an hour and found that it crawled pages alphabetically by domain name. We also found that the Grub client accessed previously crawled pages to ensure freshness of content and updated the page only when required. We don’t know yet how the Grub system decides which URLs to crawl. It would also be interesting to see published estimates from Search Wikia on how many client installations it takes on average to build a crawl of, say, 30 billion URLs.

We also anticipate that Search Wikia will also rely on the same type of developer community that created world-class open-source projects like Mozilla Firefox and Linux. An April 2007 article in Fast Company says developers have been enthusiastic about being able to tweak complex search algorithms in an open-source environment. It’s easy to imagine a lot of talented developers wanting to try their hand at a problem that’s technically challenging on several fronts (see Anna Patterson’s article: Why Writing Your Own Search Engine Is Hard)

However, this is where the comparisons to Wikipedia become less believable. Wikipedia’s model of allowing anyone to edit pages around particular topics has been successful in part because everyone considers themselves expert enough to contribute cogently on a few topics. The same cannot be said of search technology. It’s unclear whether this community can deliver to the requirements of a web search engine.

Finally, there’s the question of organic traffic to the service. Wikipedia sites constitute the eighth largest set of properties on the Web according to Internet analytics firm ComScore. Wikipedia is a certified Internet brand, but as of December 2006, Google accounted for 50 percent of its incoming traffic, barring certain caveats (see Rick Skrenta’s post for more). It sounds unlikely that Google will send the same volumes of traffic to a competing search service. This also means that Wikia must face the unappetizing task of getting users to switch ingrained search behavior and start their Web surfing at a site other than Google.

Note also that another recent company that started life as a human-powered search engine–Mahalo — seems to be relying on Google SEO for distribution and traffic. As of December 29, 2007, Google has indexed 79,600 pages from the domain Mahalo.com, quite possibly as acknowledgment of the difficulties of driving organic traffic to a competing search engine. Over the next few days, we also plan to investigate questions around the company’s business model, its organizational structure for community developers, the set of social features it will launch with, and when it expects to scale to be able to serve a large portion of queries well.

wikisearch2.jpg

wikisearch3.jpg

Tags: ,
Trackback URL

6 Trackbacks

  1. Wikia on VentureBeat « Life in the Bit Bubble said:

    [...] Here it is if you’d like to follow the link and read it: http://venturebeat.com/2008/01/04/wikia-to-launch-new-search-engine-jan-7/ [...]

  2. VentureBeat » Search Wikia launches: Will it threaten Google? said:

    [...] Jimmy Wales, who has led online encyclopedia Wikipedia to considerable success (see our previous coverage). We spoke with Wales (pictured here) under embargo last week about his [...]

  3. Search Engine Optimization Direct » Blog Archive » Wikia to launch new search engine January 7 said:

    [...] Per and Susanne Koch article is brought to you using rss feeds.Here are some of the top articles on search engine optimization.Search Wikia, the highly anticipated search engine by Wikia Inc., Jimmy Wales’ (Wikipedia’s co-founder) for-profit company, will launch publicly on January 7 and is currently in private testing mode. We tested the service’s distributed … [...]

  4. VentureBeat » Search Wikia gets slammed, here’s our review said:

    [...] covered some of its business and social aspects and an initial look when it first launched, but have since had more time to [...]

  5. VentureBeat » Search Wikia gets slammed, here’s our review said:

    [...] covered some of its business and social aspects and an initial look when it first launched, but have since had more time to [...]

  6. Overheard: Why do we need Search Wikia? - Overheard in the tech blogosphere said:

    [...] wrote earlier: We’ve tested Grub, the service’s way of crawling the Internet’s web sites to collect data. [...]

6 Comments

  1. Kevin Burton said:

    The biggest issue with building a crawler is not the bandwidth it’s scaling and having fast access to local IO.

    With Spinn3r:

    http://spinn3r.com

    We have a distributed crawler but we run it within our own cluster because having 10k clients wouldn’t really buy us anything.

    Of course maybe from Wikia’s perspective this is just a blind HTTP fetch task and they then aggregate it locally within their cluster.

    The comparison to Wikipedia might fall down here. Wikipedia has about 1.5M english pages. The net has billions.

    You can’t just rely on humans for this stuff.

    Kevin

  2. Saumil Mehta said:

    Actually, we should clarify “distributed search crawler” some more. I would assume that anyone that wants to crawl splits the crawl across n machines. I wasn’t aware of your point around local IO being a bottleneck.

    I am unable to find the Grub source code or a whole lot of technical literature on the system but that is going to be corrected very soon, I’m told.

  3. TS said:

    Actually, access to local I/O is usually NOT the main bottleneck in crawling. Assuming of course that one does not write out each retrieved page in a separate random write, and this depends on the crawler architecture that is used. Of course, once you crawl tens of thousands of URLs per second, almost everything becomes a bottleneck, but there are only a few dozen players that have a need for that kind of speed.

    The main bottlenecks in large-scale crawling (in this order) are probably crawl management (i.e., the human/software complexity side of managing and scaling a crawl to millions of hosts) and the bandwidth. The CPU power is not a major issue. Grub basically harvests bandwidth from clients.

    But I would be concerned about the crawl management part of the grub approach - are they using a fairly brute-force approach to recrawling that wastes (other people’s) bandwidth, as opposed to the smarter recrawling strategies used by the major engines? How do they deal, e.g., with requests by sites to immediately cease crawling a domain (due to possible or perceived misbehavior of the crawler or local problems at the site)? And it is not clear how grub is really fitting into the whole wikia approach.

  4. Saumil Mehta said:

    Well, I do understand how Grub fits into Wikia from a high level standpoint. They need to get a substantial crawl going to build a real search engine and they don’t have Microsoft’s and Yahoo’s millions to start from scratch.

    As to your other question, the grub website is fairly light on docs about how they actually do the crawl and doesn’t have the source code posted for people to tinker around. I expect that to change relatively soon.

  5. Chris said:

    How many people search beyond the first 2-5 pages? What is this need to index everything?

  6. Saumil Mehta said:

    Chris,

    The need to index the entire web stems not from number of results, but from the ability to serve an essentially infinite number of queries. For example, you can index a 100 million pages and just serve queries related to Entertainment or Sports. But then you can’t compete with Google anymore!

Add a Comment