Abusive IRLbot

Forum Moderators: open

Message Too Old, No Replies

Abusive IRLbot

Abusive behaviour is intentional

andye

8:12 am on Jul 13, 2006 (gmt 0)

Hi all,

Re IRLbot on 128.194.135.81: I've been in contact with the operators at Texas A & M.

They confirm that the maximum rate of this bot is 50 requests/second to a single IP address - so 4.3 million requests/day. They don't intend to change this behaviour.

I now block this bot via robots.txt and IP range blocking.

best, a.

[edited by: volatilegx at 6:06 pm (utc) on July 13, 2006]
[edit reason]
[1][edit reason] removed specifics and call to action [/edit] [/edit][/1]

nancyb

7:13 pm on Jul 13, 2006 (gmt 0)

This bot respects robots.txt so there isn't a need to block by IP also.

Pfui

8:47 pm on Jul 13, 2006 (gmt 0)

Apparently it can be spoofed and/or configured to not ask for robots.txt --

38.119.52.66 - - [06/Jun/2006:02:05:42 -0700] "GET / HTTP/1.1" 403 815 "-"
"IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]

IP address: 38.119.52.66
Reverse DNS: [No reverse DNS entry per auth1.dns.cogentco.com.]
Reverse DNS authenticity: [Unknown]

(Good old cogentco / Performance Systems International Inc. Again)

All of the legit IRLbot runs out of .cs.tamu.edu ask for robots.txt. (Don't know what it might do beyond that because I block it by UA).

andye

8:57 am on Jul 14, 2006 (gmt 0)

Many thanks to volatilegx for the edit, I was feeling rather irked when I posted, so didn't give enough thought to the ToS.

The reason I was feeling irked was that I'd just spoken by email to one of the bot's owners, and he'd instructed me rather patronisingly that we should upgrade our hardware to service the extra traffic he generates.

He says (and this is a PhD'd computer scientist, mind you) that his bot is merely simulating normal user behaviour - which is a joke.

nancyb, you're right, I've blocked by robots.txt and haven't bothered with an IP block. The reason I considered it is that this robot will fetch robots.txt at a rate of up to 50 requests/second, and we host many sites hence many robots.txt files.

best, a.

jdMorgan

2:27 pm on Jul 14, 2006 (gmt 0)

PhD computer scientists in academia apparently know little of the 'real world' of the Web, where the majority of sites are on massively-shared named-based virtual servers, rather than on their own dedicated server -- or on the university's publically-funded departmental computing center down the hall.

I've also observed that students are given little or no instruction on the 'ethics and ethos' of the Web -- Such things as complying with robots.txt, avoiding copyright infringement, rate-limiting requests, and other subjects are not introduced, apparently. Either that, or the knowledge level of the professors or the skill level of the students is too low to bother with coding such niceties. Their 'minor code simplification' becomes our Denial Of Service Attack.

Instead, an 'attitude of entitlement' is imbued: "It's the Web -- We can take what we want, any way we want!" I can't say that this attitude is limited to this little part of the academic world, though -- it's endemic in all aspects of modern life, unfortunately.

The only competent response I've ever gotten from contacting a school was from the CS department head at the University of Sydney (and thanks).

Note to current students: Search for "process queueing."

Plonk! 403-Bye

Jim

incrediBILL

4:05 pm on Jul 14, 2006 (gmt 0)

Um, guys, you may want to put down the pitch forks and torches.

According to their website it supports Crawl-Delay, which I don't see as terribly abusive if they honor that:

The crawler is by default rate-limited to one HTML page per website per minute; however, this metric may dynamically change during the crawl depending on the size and popularity of each site. Robots.txt can be used to override this behavior by specifying the minimum delay between visits (in seconds):
User-agent: IRLbot
Crawl-delay: 100

[irl.cs.tamu.edu...]

When they hit my site a couple of days ago they were slightly slower than the default speed the website claims:

07/14/2006 03:47:38 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 03:49:51 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 04:22:14 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 05:38:13 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 05:41:35 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 07:59:49 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]
07/14/2006 08:07:35 128.194.135.81 "IRLbot/2.0 (compatible; MSIE 6.0; [irl.cs.tamu.edu...]

Doesn't matter, I'm not letting them in as I see no value in this silly little bot.

[edited by: incrediBILL at 4:07 pm (utc) on July 14, 2006]

Romeo

4:37 pm on Jul 14, 2006 (gmt 0)

User-agent: IRLbot
Crawl-delay: 100

This is nice, but I don't have the time to research every small silly/rude/annoying/whatever bot.

Finally, after years of exercising some sort of ridiculous tolerance, I have changed my robots.txt to allow a few handselected bots only, which seem useful (bots which are feeding public search engines, like google, yahoo, msn, ask), and dis-allow all others.
Don't know what all the other bots really are for, but I see no reason for them to take my content at my bandwidth's expenses.
And for those abusive bots not complying with robots.txt, there are some additional .htaccess rules and a bot trap.
Case closed, and my sites' logs look much cleaner now.

Kind regards,
R.

Staffa

8:09 pm on Jul 14, 2006 (gmt 0)

Don't know what all the other bots really are for, but I see no reason for them to take my content at my bandwidth's expenses.
And for those abusive bots not complying with robots.txt, there are some additional .htaccess rules and a bot trap.
Case closed, and my sites' logs look much cleaner now.

Amen to that!

incrediBILL

10:27 pm on Jul 14, 2006 (gmt 0)

This bot doesn't take your content, it's an internet topology project that only looks at link networks.

This is certainlay a bunch of opinionated people as you pass judgement without understanding the nature of the beast.

WHO AM I TO TALK .... I have EVERYTHING blocked.... BWAHAHAHAHA!

Pfui

2:56 am on Jul 15, 2006 (gmt 0)

OT... Bill, dear. Time for your meds. Come along now. : )

andye

4:01 pm on Jul 15, 2006 (gmt 0)

supports Crawl-Delay

Only per-domain-name, not per-IP-address. So if you're hosting lots of domain names then it'll still crawl at a maximum rate of 50 hits/second (across all the different sites), even if you do set Crawl-Delay. That's my understanding.

pass judgement without understanding the nature of the beast

Well, I have exchanged emails with one of the bot's authors (totally unhelpful), and the network admins at Texas A & M (helpful). So I'm not saying I've seen the bot's source or anything like that, but I have got a decent picture of what they're up to.

'attitude of entitlement'

That's certainly the author's attitude. I've felt tempted to send him an invoice for the extra resources he's told me we ought to buy in order to cater for him.

All the best, a.

incrediBILL

6:18 pm on Jul 15, 2006 (gmt 0)

So if you're hosting lots of domain names then it'll still crawl at a maximum rate of 50 hits/second

That may be true, but if you had 100 domains and this crawler each is set to crawl-delay 100 in each domain, then even asking for 50 hits/second it could only ask for 100 pages in 100 seconds tops which would throttle it somewhat.

I'd probably set it to crawl delay 3600 or higher just for giggles ;)

They really need to rethink 50 hits/second though as that can bring some dynamic sites, like an ecommerce server, to their knees. Assuming it takes 1 second to respond to the request with complex backend database processing and such, it would barely process 50 requests in a second. Now assuming lots of visitors and other search engines also making casual requests at the same time, it would start to lose the battle and requests would build up a backlog until the machine was 'hung' until it could process out all of those pending requests.

Basically, the box would appear frozen to visitors and search engines at that time until the queue clears.

I know this happens as that's what a certain Chinese crawler kept doing to my site making 100s of requests a second. That's why I now throttle sites from making rapid requests and just give them a timeout instead of letting them chew up the CPU crunching out dynamic pages.

Pfui

8:49 pm on Jul 15, 2006 (gmt 0)

I found Crawl-delay so infrequently respected, by Google for example, that I got tired of tweaking the setting to no avail and got radical...

I added --

<!--#exec cmd="sleep 3"-->

-- atop every daily-spidered .html page (about 140 or so).

C'mon. It's just three seconds.

Visitors clicking along (most of the pages are next-back linked, and thus all too easily scraped) don't notice the delay because the pages are study-oriented. Bots literally cannot run through them at too-high speeds. And if I spot a new offline-downloader or whatever doing its thing, I still have time to kick 'em before they're done.

I know a lot of you use throttling scripts based on a user's IP, etc., and I'd love to do the same. I looked into a score of similar programs and mod_bwshare [webmasterworld.com] would solve a boatload of my bandwidth problems, but alas, I can't get it to install. So for now, "sleep 3" works for me. FWIW.

helleborine

2:28 pm on Jul 28, 2006 (gmt 0)

This bot is crawling my forum like there's no tomorrow. It's all over the place, all the time.

Makes it look like I've got tons of visitors!

helleborine

12:49 am on Aug 4, 2006 (gmt 0)

And it's crawling it daily. That's a bot with nothing better to do.

Ocean10000

7:28 pm on Aug 4, 2006 (gmt 0)

Yes this bot is a pain. I started doing 301 Redirects back to its listed website. Let them start feeling the effect of there own bot.

smells so good

5:20 am on Aug 5, 2006 (gmt 0)

This bot doesn't take your content, it's an internet topology project that only looks at link networks.

It may not collect my dynamic page content, but I have a timer on each page, as well as a completion flag for the page load. My stats are telling me that every page is getting loaded. In other words, it appears that the bot is issuing a GET request instead of a HEAD request. My logs should confirm that, I haven't looked yet. First I need to update robots.txt

volatilegx

2:00 pm on Aug 5, 2006 (gmt 0)

It would have to issue a GET request and read the entire page in order to parse it for links...

jmccormac

3:05 pm on Aug 5, 2006 (gmt 0)

Crawl-delay: Infinity + 1

Most of these operations are a waste of bandwidth and a considerable drain on resources for a large site. They are only slightly above the scraper scum category.

Regards...jmcc

GaryK

3:26 pm on Aug 5, 2006 (gmt 0)

IMO Texas A&M has a well-deserved reputation for setting loose all sorts of badly behaved bots. It got so bad a few months ago that I added TAMU-NET to my list of banned IP Addresses.

BTW, when contacting suspected owners of bots I've found it helps to write an emotionally neutral e-mail. Don't antagonize someone that you want help from. I'm not accusing anyone of anything. It's just a hopefully helpful piece of advice.

Abusive IRLbot

Abusive behaviour is intentional

andye

nancyb

Pfui

andye

jdMorgan

incrediBILL

Romeo

Staffa

incrediBILL

Pfui

andye

incrediBILL

Pfui

helleborine

helleborine

Ocean10000

smells so good

volatilegx

jmccormac

GaryK

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week