incrediBILL - 1:48 pm on Jan 2, 2011 (gmt 0)
By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about.
Not really, because either with black or white listing you still have to keep an eye on what's new making requests on your server. The difference is with white-listing you get to decide how your data is exported but with blacklisting it's too late, data is out the gate, trying to stop it from being used after the fact it a real problem. By monitoring what's asking for robots.txt you can find things that would actually honor robots.txt, and add anything useful to your whitelist.
Robots.txt is really toothless. What most don't do is also build a whitelisted .htaccess file, which is extremely important as it's a hard block to stop things from getting past robots.txt that aren't allowed.
Trust me, monitoring new robots.txt requests is a lot less work than monitoring for bad activity that needs to be blocked.
It's doing work smart vs. doing it the hard way, as blacklisting is an infinite time suck and whitelisting is finite.
Besides, if you're currently getting referral traffic from these search engines you already know which ones to add to your white list.
If you aren't getting any referral traffic, they're just a drain on your resources.
I'd like to whitelist Blekko, I'm just a NOARCHIVE away from doing it! :)
and archive.org offers you NO way of removing back content.
I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.
However, they provide an email address, I wrote to them, and now my content is blocked from searching in archive.org, they were very prompt about it too.
> I'm going to use robots.txt until Blekko changes its stance on this one.
Yep. That's the answer for now.
Not really, blekko doesn't check it very often.
Crawled: 23h ago
Robots: http://www.webmasterworld.com/robots.txt (last fetched: 20d ago)
Crawled WebmasterWorld a day ago but hasn't check robots.txt in 20 days, saw other sites where robots hadn't been checked in months "(last fetched: 79d ago)" so changing robots.txt won't stop them anytime soon.
IMO, robots.txt should be checked every 24h, but that's a whole new thread.