Forum Moderators: mack

Message Too Old, No Replies

MSNbot aggressively over-indexing dynamic sites?

9 gigs of bandwidth in an hour!

         

Miss_Lynx

11:41 pm on Apr 10, 2005 (gmt 0)



A client of mine recently got a message from their web host saying they'd gone way over their monthly bandwidth allotment, only 1 week into the month... to the tune of more than $2000 in overusage charges!

A bit of investigating turned up the fact that the MSNbot had been hitting the site, which is a dynamic site based on a PostgreSQL database with a PHP front end, very aggressively for the past several days (as in that bot alone was accounting for 90% of the site's traffic), and then, late one night, it basically went crazy and started querying the database over and over, racking up over 9 gigs of bandwidth in a single hour before it calmed down.

I've now banned it in the robots.txt file, and it turns out the $2000 charge was based on average out the amount of traffic thus far this month to the whole month, so the estimate of what they will have to pay is dropping daily now that the spike's past, but there's no way they won't be looking at at least some kind of extra charges this month.

And the really bizarre thing is that the site it did this to hasn't even launched yet - it hasn't been submitted to any search engines and isn't linked to from anywhere that I know of, so I don't know how search engine bots are finding their way in (other bots have been indexing it too, but none of them have gone nuts the way the MSNbot did).

A friend pointed me to a thread here from last year in which several people mentioned having had problems with the MSNbot being over-aggressive in crawling both static and dynamic sites, but that thread is locked and there don't seem to have been any other discussions on the topic here lately that I can find.

Are other people here still having this problem? Does anyone know if there's a way to prevent what happened to that site and still get indexed normally? Getting spidered by search engine bots is all very well and good, but it shouldn't amount to a DOS attack, or run you up hundreds of dollars in extra charges...

blacknight

1:33 am on Apr 11, 2005 (gmt 0)

10+ Year Member



A number of our clients have had issues with the MSN Bot and dynamic sites where it would keep running queries and following dynamic links ad infinitum.

robho

8:54 pm on Apr 17, 2005 (gmt 0)

10+ Year Member



Does anyone know if there's a way to prevent what happened to that site and still get indexed normally?

I was hit by several aggressive bots last year (including the Googlebot, and to a lesser extent MSN), on a new site which has over a million dynamic pages.

I installed mod_throttle. This is for an Apache server (I assume there's something similar for the Microsoft server).

This allows you to set limits on how many pages any IP may request over a given time. After a bit of experimenting I set it to "no more than 2700 document requests in 45 minutes". All well-behaved bots and humans are under that limit.

As a side-effect, it also helps slow people attempting to copy your entire site, if the site is very large.

The site still gets indexed fine - just coming up on a million different pages indexed on Google from that site, after 6 months (but only a few thousand in MSN, they have breadth but little depth).

Span

9:16 pm on Apr 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



..racking up over 9 gigs of bandwidth in a single hour before it calmed down.

crawl-delay could help too:
[webmasterworld.com ]

zeus

7:26 pm on Apr 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have 4 pure HTML sites and 2 dynamic sites, the Pure HTMl is only indexed 5-10% of total pages and the dynamic are fully indexed, so yes MSN like dynamic sites.