Forum Moderators: mack
A bit of investigating turned up the fact that the MSNbot had been hitting the site, which is a dynamic site based on a PostgreSQL database with a PHP front end, very aggressively for the past several days (as in that bot alone was accounting for 90% of the site's traffic), and then, late one night, it basically went crazy and started querying the database over and over, racking up over 9 gigs of bandwidth in a single hour before it calmed down.
I've now banned it in the robots.txt file, and it turns out the $2000 charge was based on average out the amount of traffic thus far this month to the whole month, so the estimate of what they will have to pay is dropping daily now that the spike's past, but there's no way they won't be looking at at least some kind of extra charges this month.
And the really bizarre thing is that the site it did this to hasn't even launched yet - it hasn't been submitted to any search engines and isn't linked to from anywhere that I know of, so I don't know how search engine bots are finding their way in (other bots have been indexing it too, but none of them have gone nuts the way the MSNbot did).
A friend pointed me to a thread here from last year in which several people mentioned having had problems with the MSNbot being over-aggressive in crawling both static and dynamic sites, but that thread is locked and there don't seem to have been any other discussions on the topic here lately that I can find.
Are other people here still having this problem? Does anyone know if there's a way to prevent what happened to that site and still get indexed normally? Getting spidered by search engine bots is all very well and good, but it shouldn't amount to a DOS attack, or run you up hundreds of dollars in extra charges...
Does anyone know if there's a way to prevent what happened to that site and still get indexed normally?
I was hit by several aggressive bots last year (including the Googlebot, and to a lesser extent MSN), on a new site which has over a million dynamic pages.
I installed mod_throttle. This is for an Apache server (I assume there's something similar for the Microsoft server).
This allows you to set limits on how many pages any IP may request over a given time. After a bit of experimenting I set it to "no more than 2700 document requests in 45 minutes". All well-behaved bots and humans are under that limit.
As a side-effect, it also helps slow people attempting to copy your entire site, if the site is very large.
The site still gets indexed fine - just coming up on a million different pages indexed on Google from that site, after 6 months (but only a few thousand in MSN, they have breadth but little depth).
..racking up over 9 gigs of bandwidth in a single hour before it calmed down.