Forum Moderators: goodroi
I found the proposals for white-listing interesting: Throttle all bot type activity except bots with benefits. :)
Based upon what I've been reading I'm reduced to asking this about robots.txt:
Why bother?
If bother, how much?
The problem with bad bots is only going to get worse with time.
It is aleady a whole lot worse than most would realize.
Banning those bad bots is not enough anymore. They just turn around and scrape content off SE provided (MSN preview, Google Cache, the list goes on...) caches. Those crawls do not show up in our logs, so unless one goes looking for those copycats proactively, one would never know...
No wonder many are seriously considering adding no-archive to all their pages. Brett has been doing it for years.
Ask, Looksmart
Forgot I have Ask whitelisted too but not Looksmart.
Before I started building the whitelist I looked to see who had been sending me traffic over the last 12 months and any SE with no meaningful ROI for the crawl got dropped off the list.
It's an easy judgement call if you give up 40,000 pages a month to a SE and only get 3 visitors a month in return. Blocking that crawler is a real no-brainer in my mind compared to allowing other crawlers that may send you hundreds or thousands of visitors a day in return for the crawl.
No wonder many are seriously considering adding no-archive to all their pages
Yup, it's the only way to lock your content down as the SEs can't even stop bots hitting them appearing to be random from a series of anonymous proxy servers. That's why I'm also on a vendetta to stop all access by anonymous proxy servers I can detect as well as I've noticed crawling ping-pong'ing between several IPs that aren't even closely related and it turned out to be anonymous proxies.
Besides, leaving old cache content on SEs exposes your site in other ways that I'd prefer not getting into on this thread, that's a whole different debate.
Let me qualify the word OBEYS robots.txt as the bad bots try to get just enough information to fly under the radar undetected which is why I whitelist good bots by IPs which stops that problem in it's tracks.
Perhaps I've missed a post in this thread, but I can't see how any of this whitelisting stuff can work. Sure, it'll work against bots that use certain User-Agent strings, but what about those which use regular browser strings?
Not all bots are high impact, either, you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day.
It's better not to blacklist or whitelist, but actually serve defective content to people you don't want crawling. That way they think they're getting something useful, but they're really not (and it'll take them a lot longer to try other techniques). Fail quietly.
but I can't see how any of this whitelisting stuff can work
I bounce thousands of page requests a day from whitelisting alone as all the random bot names that come along just bounce off the wall, they get nothing.
but what about those which use regular browser strings ... you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day
Server side script analyzes their behavior and challenges or bans them as well, so I'm stopping them and people using AlexKs script might be as well.
[webmasterworld.com...]
It takes multiple techniques to squish the nonsense, but it can be squished for the most part.