Forum Moderators: open
Just a heads-up that BecomeBot, IP range 64.124.85.0 to 64.124.85.255, is very aggressive in crawling multiple sites hosted on the same IP address.
We've been seeing a series of requests that I think is too fast, ie over ten requests per second maintained for a while.
Typical logfile entry looks like this:
64.124.85.80 - - [24/May/2005:11:52:05 +0100] "GET /robots.txt HTTP/1.1" 200 123 "-" "Mozilla/5.0 (compatible; BecomeBot/1.86; MSIE 6.0 compatible; +http://www.become.com/site_owners.html)"
It does seem to honour robots.txt.
Best wishes, Andy.
[webmasterworld.com...]
I've had it denied from the start and yet it still continues to visit and eat 403's.
Don
About its aggressiveness. The following is a quote from become.com/site_owners.html
You can control the rate at which your site is crawled by using the Crawl-Delay feature. The Crawl-Delay feature allows you to specify the number of seconds between visits to your site. Note that it may take quite a long time to crawl a site if there are many pages and the Crawl-Delay is set high. You could specify an interval of 30 seconds between requests with an entry like this:User-agent: BecomeBot
Crawl-Delay: 30
Disallow: /cgi-bin
My reasoning is: we were receiving requests in rapid series, but not for the same site. Eg we'll get multiple requests in one second, each for a different site. As this delay is way less than the default delay should be for any bot, I'm guessing that they've got a particular bug in their bot: they're counting the delay between requests for each hostname, rather than each IP address.
And the reason I haven't tested this to see what happens is:
1) I think it's up to them to set reasonable default values - I'm not willing to spend time working around their mistakes.
2) Their bot was causing performance degredation for our sites, so I was keen to just block it off as quickly as possible.
Best wishes, Andy.