Forum Moderators: open

Message Too Old, No Replies

BecomeBot - nasty little critter, very aggressive

Crawling too quickly.

         

andye

11:04 am on May 24, 2005 (gmt 0)

10+ Year Member



Hi all,

Just a heads-up that BecomeBot, IP range 64.124.85.0 to 64.124.85.255, is very aggressive in crawling multiple sites hosted on the same IP address.

We've been seeing a series of requests that I think is too fast, ie over ten requests per second maintained for a while.

Typical logfile entry looks like this:

64.124.85.80 - - [24/May/2005:11:52:05 +0100] "GET /robots.txt HTTP/1.1" 200 123 "-" "Mozilla/5.0 (compatible; BecomeBot/1.86; MSIE 6.0 compatible; +http://www.become.com/site_owners.html)"

It does seem to honour robots.txt.

Best wishes, Andy.

wilderness

5:06 pm on May 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This thing is a pest to everybody.

[webmasterworld.com...]

I've had it denied from the start and yet it still continues to visit and eat 403's.

Don

andye

5:36 pm on May 24, 2005 (gmt 0)

10+ Year Member



oh whoops - sorry for the dupe thread. My bad.

guitaristinus

12:03 am on May 25, 2005 (gmt 0)

10+ Year Member



I've been letting it crawl my sites. I've got bandwidth to spare at the moment. Their site, become.com, may bring in some traffic.

About its aggressiveness. The following is a quote from become.com/site_owners.html

You can control the rate at which your site is crawled by using the Crawl-Delay feature. The Crawl-Delay feature allows you to specify the number of seconds between visits to your site. Note that it may take quite a long time to crawl a site if there are many pages and the Crawl-Delay is set high. You could specify an interval of 30 seconds between requests with an entry like this:

User-agent: BecomeBot
Crawl-Delay: 30
Disallow: /cgi-bin

andye

8:42 am on May 25, 2005 (gmt 0)

10+ Year Member



Hi guitaristinus - on this issue of aggressiveness: I noticed their delay between requests was configurable - I haven't tried it, but I'm guessing that it wouldn't work for multiple sites hosted on the same IP address.

My reasoning is: we were receiving requests in rapid series, but not for the same site. Eg we'll get multiple requests in one second, each for a different site. As this delay is way less than the default delay should be for any bot, I'm guessing that they've got a particular bug in their bot: they're counting the delay between requests for each hostname, rather than each IP address.

And the reason I haven't tested this to see what happens is:
1) I think it's up to them to set reasonable default values - I'm not willing to spend time working around their mistakes.
2) Their bot was causing performance degredation for our sites, so I was keen to just block it off as quickly as possible.

Best wishes, Andy.