| This 34 message thread spans 2 pages: < < 34 ( 1  ) || |
|In the era of increasing numbers of bad bots is robots.txt irrelevant?|
Why bother? Why not go the other way and white-list with controls?
| 9:29 pm on Feb 18, 2006 (gmt 0)|
I just read through a number of posts in an attempt to understand what, if anything, can be done about bad bots.
I found the proposals for white-listing interesting: Throttle all bot type activity except bots with benefits. :)
Based upon what I've been reading I'm reduced to asking this about robots.txt:
If bother, how much?
| 9:14 pm on Feb 20, 2006 (gmt 0)|
|The problem with bad bots is only going to get worse with time. |
It is aleady a whole lot worse than most would realize.
Banning those bad bots is not enough anymore. They just turn around and scrape content off SE provided (MSN preview, Google Cache, the list goes on...) caches. Those crawls do not show up in our logs, so unless one goes looking for those copycats proactively, one would never know...
No wonder many are seriously considering adding no-archive to all their pages. Brett has been doing it for years.
| 9:45 pm on Feb 20, 2006 (gmt 0)|
Forgot I have Ask whitelisted too but not Looksmart.
Before I started building the whitelist I looked to see who had been sending me traffic over the last 12 months and any SE with no meaningful ROI for the crawl got dropped off the list.
It's an easy judgement call if you give up 40,000 pages a month to a SE and only get 3 visitors a month in return. Blocking that crawler is a real no-brainer in my mind compared to allowing other crawlers that may send you hundreds or thousands of visitors a day in return for the crawl.
|No wonder many are seriously considering adding no-archive to all their pages |
Yup, it's the only way to lock your content down as the SEs can't even stop bots hitting them appearing to be random from a series of anonymous proxy servers. That's why I'm also on a vendetta to stop all access by anonymous proxy servers I can detect as well as I've noticed crawling ping-pong'ing between several IPs that aren't even closely related and it turned out to be anonymous proxies.
Besides, leaving old cache content on SEs exposes your site in other ways that I'd prefer not getting into on this thread, that's a whole different debate.
| 8:11 pm on Feb 21, 2006 (gmt 0)|
|Let me qualify the word OBEYS robots.txt as the bad bots try to get just enough information to fly under the radar undetected which is why I whitelist good bots by IPs which stops that problem in it's tracks. |
Perhaps I've missed a post in this thread, but I can't see how any of this whitelisting stuff can work. Sure, it'll work against bots that use certain User-Agent strings, but what about those which use regular browser strings?
Not all bots are high impact, either, you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day.
It's better not to blacklist or whitelist, but actually serve defective content to people you don't want crawling. That way they think they're getting something useful, but they're really not (and it'll take them a lot longer to try other techniques). Fail quietly.
| 9:21 pm on Feb 21, 2006 (gmt 0)|
|but I can't see how any of this whitelisting stuff can work |
I bounce thousands of page requests a day from whitelisting alone as all the random bot names that come along just bounce off the wall, they get nothing.
|but what about those which use regular browser strings ... you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day |
Server side script analyzes their behavior and challenges or bans them as well, so I'm stopping them and people using AlexKs script might be as well.
It takes multiple techniques to squish the nonsense, but it can be squished for the most part.
| This 34 message thread spans 2 pages: < < 34 ( 1  ) |