Craigslist really, really, really does not want any of its listings showing anywhere else on the web.
First, Craigslist blocked PadMapper and sued the one-person apartment-search startup. Then it changed the license under which users post ads to the site, making it exclusive, all in an attempt to ensure that what gets posted on Craigslist is only available on that site.
Now Craigslist has gone completely thermonuclear by preventing even search engines such as Google from accessing the site. At least, according to 3Taps, a service for accessing Craigslist data:
Above: Craigslist’s robot.txt file
Image Credit: John Koetsier
Because, of course, in order for Google to find what you’re looking for, it has to have found that data first itself. And stored it, and indexed it.
That’s the job of Googlebot, the search engine’s spider, which traverses the web, finding and reporting information.
It was a solution Paul Graham, legendary technologist, found innovative and “ingenious.”
There’s a way to squash spiders, however: robots.txt.
Put this file at the root directory of your web server, add some code to indicate what is and what is not allowed, and good spiders will do what they are told.
In Craigslist’s case, that means: don’t index our listing. And Googlebot, being a good corporate citizen, obeys.
This really is the thermonuclear option, however, because telling the world’s leading search engine to stop indexing your content is going to result in literally millions of fewer listings on Google … and potentially a huge concurrent drop in traffic.
Only a very confident — and very angry — company would do something this bold, this risky.
Craiglist has shown that it does not want any other parties using its listings and is ready to sue seemingly at the drop of a hat. And it’s not without supporters, including here at VentureBeat.
But if the company keeps jumping on rivals, it may have to drop the peace icon that shows up when you visit the site (see the screenshot above).
Because if there’s one thing the little company with big traffic has besides a virtual lock on the free listings market, it’s plenty of fight.
A quick note:
I just checked Craigslist’s top-level robots.txt file and I’m not seeing how listings are being blocked.
The third chunk of text applies to all robots, which would include Googlebot. Areas that are disallowed, however, do not seem to include /apa/, which is Craigslist’s top level category for apartment listings, for instance.
However, sites do sometimes have multiple robots.txt files, and this is what 3Maps is currently reporting on its home page:
At approximately noon on Sunday August 5th, Craigslist instructed all general search engines to stop indexing CL postings — effectively blocking 3taps and other 3rd party use of that data from these public domain sources. We are sorry that CL has chosen this course of action and are exploring options to restore service but may be down for an extended period of time unless we or CL change practices. As soon as we know more, we will share it here and on our Twitter account.
I’ve asked a number of search engine optimization experts for their opinions and will update this post when I hear back.
[ update ]
Here’s an explanation from search marketing specialist Graeme McLaughlin:
There are a few ways you can disallow bots.
Explicit and wildcard via robots.txt – Craiglist is using both and the wildcard will apply to Google.
Meta Robots in the HEAD section of the HTML.
The “*” is a wildcard and means that robots can NOT crawl the directories listed beneath it. For Craiglist to do this they must be confident that their brand in the free listings niche will continue to drive traffic for them.
They are allowing the geo home page for each city to be indexed. This covers them for users doing branded / navigational searches which are becoming more common with browsers such as Chrome and the new Safari using the URL bar as an omni search box.
The robots txt file stops engines and bots from crawling the site but Craiglist is also using on-page meta tag robots commands on their listings pages. Example:
<meta name=”robots” content=”NOARCHIVE,NOFOLLOW”>
The NO ARCHIVE prevents a cached copy of the page being available in the search results and NO FOLLOW instructs bots not to follow the links on the page.
They are using a combo tactic to isolate the pages they don’t want in the index and in affect blocking them from the third party tools.