Forum Moderators: open
Re IRLbot on 128.194.135.81: I've been in contact with the operators at Texas A & M.
They confirm that the maximum rate of this bot is 50 requests/second to a single IP address - so 4.3 million requests/day. They don't intend to change this behaviour.
I now block this bot via robots.txt and IP range blocking.
best, a.
[edited by: volatilegx at 6:06 pm (utc) on July 13, 2006]
[edit reason]
[1][edit reason] removed specifics and call to action [/edit] [/edit][/1]
38.119.52.66 - - [06/Jun/2006:02:05:42 -0700] "GET / HTTP/1.1" 403 815 "-"
"IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
IP address: 38.119.52.66
Reverse DNS: [No reverse DNS entry per auth1.dns.cogentco.com.]
Reverse DNS authenticity: [Unknown]
(Good old cogentco / Performance Systems International Inc. Again)
All of the legit IRLbot runs out of .cs.tamu.edu ask for robots.txt. (Don't know what it might do beyond that because I block it by UA).
The reason I was feeling irked was that I'd just spoken by email to one of the bot's owners, and he'd instructed me rather patronisingly that we should upgrade our hardware to service the extra traffic he generates.
He says (and this is a PhD'd computer scientist, mind you) that his bot is merely simulating normal user behaviour - which is a joke.
nancyb, you're right, I've blocked by robots.txt and haven't bothered with an IP block. The reason I considered it is that this robot will fetch robots.txt at a rate of up to 50 requests/second, and we host many sites hence many robots.txt files.
best, a.
I've also observed that students are given little or no instruction on the 'ethics and ethos' of the Web -- Such things as complying with robots.txt, avoiding copyright infringement, rate-limiting requests, and other subjects are not introduced, apparently. Either that, or the knowledge level of the professors or the skill level of the students is too low to bother with coding such niceties. Their 'minor code simplification' becomes our Denial Of Service Attack.
Instead, an 'attitude of entitlement' is imbued: "It's the Web -- We can take what we want, any way we want!" I can't say that this attitude is limited to this little part of the academic world, though -- it's endemic in all aspects of modern life, unfortunately.
The only competent response I've ever gotten from contacting a school was from the CS department head at the University of Sydney (and thanks).
Note to current students: Search for "process queueing."
Plonk! 403-Bye
Jim
According to their website it supports Crawl-Delay, which I don't see as terribly abusive if they honor that:
The crawler is by default rate-limited to one HTML page per website per minute; however, this metric may dynamically change during the crawl depending on the size and popularity of each site. Robots.txt can be used to override this behavior by specifying the minimum delay between visits (in seconds):User-agent: IRLbot
Crawl-delay: 100
[irl.cs.tamu.edu...]
When they hit my site a couple of days ago they were slightly slower than the default speed the website claims:
07/14/2006 03:47:38 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 03:49:51 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 04:22:14 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 05:38:13 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 05:41:35 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 07:59:49 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 08:07:35 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
Doesn't matter, I'm not letting them in as I see no value in this silly little bot.
[edited by: incrediBILL at 4:07 pm (utc) on July 14, 2006]
User-agent: IRLbot
Crawl-delay: 100
This is nice, but I don't have the time to research every small silly/rude/annoying/whatever bot.
Finally, after years of exercising some sort of ridiculous tolerance, I have changed my robots.txt to allow a few handselected bots only, which seem useful (bots which are feeding public search engines, like google, yahoo, msn, ask), and dis-allow all others.
Don't know what all the other bots really are for, but I see no reason for them to take my content at my bandwidth's expenses.
And for those abusive bots not complying with robots.txt, there are some additional .htaccess rules and a bot trap.
Case closed, and my sites' logs look much cleaner now.
Kind regards,
R.
Don't know what all the other bots really are for, but I see no reason for them to take my content at my bandwidth's expenses.
And for those abusive bots not complying with robots.txt, there are some additional .htaccess rules and a bot trap.
Case closed, and my sites' logs look much cleaner now.
supports Crawl-Delay
Only per-domain-name, not per-IP-address. So if you're hosting lots of domain names then it'll still crawl at a maximum rate of 50 hits/second (across all the different sites), even if you do set Crawl-Delay. That's my understanding.
pass judgement without understanding the nature of the beast
Well, I have exchanged emails with one of the bot's authors (totally unhelpful), and the network admins at Texas A & M (helpful). So I'm not saying I've seen the bot's source or anything like that, but I have got a decent picture of what they're up to.
'attitude of entitlement'
That's certainly the author's attitude. I've felt tempted to send him an invoice for the extra resources he's told me we ought to buy in order to cater for him.
All the best, a.
So if you're hosting lots of domain names then it'll still crawl at a maximum rate of 50 hits/second
That may be true, but if you had 100 domains and this crawler each is set to crawl-delay 100 in each domain, then even asking for 50 hits/second it could only ask for 100 pages in 100 seconds tops which would throttle it somewhat.
I'd probably set it to crawl delay 3600 or higher just for giggles ;)
They really need to rethink 50 hits/second though as that can bring some dynamic sites, like an ecommerce server, to their knees. Assuming it takes 1 second to respond to the request with complex backend database processing and such, it would barely process 50 requests in a second. Now assuming lots of visitors and other search engines also making casual requests at the same time, it would start to lose the battle and requests would build up a backlog until the machine was 'hung' until it could process out all of those pending requests.
Basically, the box would appear frozen to visitors and search engines at that time until the queue clears.
I know this happens as that's what a certain Chinese crawler kept doing to my site making 100s of requests a second. That's why I now throttle sites from making rapid requests and just give them a timeout instead of letting them chew up the CPU crunching out dynamic pages.
I added --
<!--#exec cmd="sleep 3"-->
-- atop every daily-spidered .html page (about 140 or so).
C'mon. It's just three seconds.
Visitors clicking along (most of the pages are next-back linked, and thus all too easily scraped) don't notice the delay because the pages are study-oriented. Bots literally cannot run through them at too-high speeds. And if I spot a new offline-downloader or whatever doing its thing, I still have time to kick 'em before they're done.
I know a lot of you use throttling scripts based on a user's IP, etc., and I'd love to do the same. I looked into a score of similar programs and mod_bwshare [webmasterworld.com] would solve a boatload of my bandwidth problems, but alas, I can't get it to install. So for now, "sleep 3" works for me. FWIW.
This bot doesn't take your content, it's an internet topology project that only looks at link networks.
BTW, when contacting suspected owners of bots I've found it helps to write an emotionally neutral e-mail. Don't antagonize someone that you want help from. I'm not accusing anyone of anything. It's just a hopefully helpful piece of advice.