Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot killed my site! - used up all my bandwidth

What's your record bandwidth to GBot?!

         

surfgatinho

10:50 am on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Woke up to find one of my sites had run out of bandwidth. Assuming a DOS attack I looked at the logs and found it was Google and they'd trounced me!

5646 hits - 1498227Kb - 66.249.65.230
And that was over three days!

Yes the site is dynamic, but the structure has been unchanged for the last 6 months. So why now?

The Contractor

12:02 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google often hits this many pages of I site I have. I'm sure others with even larger sites get crawled even more.

On 02/03/06 Google crawled 3458 pages. This is the most in one day so far this month. Normally it hits between 200-1000 pages a day.

The Contractor

12:30 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



5646 hits - 1498227Kb

hmm...unless my math is bad 1498227/5646 = 265KB average page size...are your pages really that huge?

victor

12:40 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Googlebot did that to me twice.

First time, I wrote to them requesting them to rein in their bot or be banned. They wrote back apologising.

Second time, they never replied.

Goole's not been the only bad bot that has hit the sites, so the sites are now protected by flood control....And bot that runs wild gets an automatic 10 minute to 72 hour ban (depends on the degree of wildness) And no exceptions for bots like google.

Any pages they grab while banned are simply a short flood-control note explaining that the bot that indexed the page is behaving badly. At any one time, I can find several of my pages in several search engines by looking for the appropriate phrase from my flood-control text.

This is not of course cloaking -- any user who hits the F5 button enough times will start getting the flood control page.

I do something similar withn RSS feeds too: there are some badly behaved ones that ignore the <skipHours> tag and want to revisit multiple times an hour.

BillyS

5:53 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>hmm...unless my math is bad 1498227/5646 = 265KB average page size...are your pages really that huge?

Contractor's got a good point. If your pages really are that large (detailed pictures perhaps?), you might want to consider buying more bandwidth. How many pages of this size do you have (or perhaps you've got one large page that is pushing you over...).

surfgatinho

8:00 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't hink I have any pages of that size, they're all around 11Kb.

rainborick

8:42 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's a form at [google.com...] that lets you request a slowdown in the crawl rate. The FAQ mentions including a portion of your server logs that shows Googlebot overtaxing your server.

Animated

1:09 am on Feb 9, 2006 (gmt 0)

10+ Year Member



i think a robot.txt would help:)

ALbino

4:36 am on Feb 9, 2006 (gmt 0)

10+ Year Member



Try blocking images.

derkeiler

5:51 am on Feb 9, 2006 (gmt 0)

10+ Year Member



5646 hits - 1498227Kb - 66.249.65.230
And that was over three days!

To be honest: I don't see why this hit rate (one page fetched every 45 seconds) is overloading your site.

For my part i'm quite happy when google is spidering my site heavily, because that's a sign that the site is popular.

The last time google overloaded my site (one year ago) with a rate of 30 pages/second i decided to upgrade my hardware to a cluster (15 machines) to solve this problem.

Every visit of googlebot rises the chance to get more visitors.

AlexK

6:36 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A dynamic site will not respond to If-Modified-Since and other such request headers, meaning that unchanged pages will appear brand-new and be repeatedly re-requested, wasting bandwidth.

If a PHP-site, have a look at this thread for a Content-Negotiation Class [webmasterworld.com]. Easy to implement, and will fix the above problems for good. Some support on the Class is also available on this and following pages [webmasterworld.com].

AlexK

7:09 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



derkeiler:
last time google overloaded my site ... a rate of 30 pages/second ... upgrade my hardware to a cluster (15 machines) to solve this problem.

If your staff is measured in hundreds, or you run extremely popular forum(s), or a network of scraper sites (or whatever), then it is likely that your site produces enough new pages each day to cause G to need to take 30 pages a second.

Any other scenario means that your site is mis-configured (see msg#11) and a few hours work would have fixed the problem for nothing.

i'm quite happy when google is spidering my site heavily, because that's a sign that the site is popular

That is certainly possible but, for the sake of sanity, I would suggest that you also consider other possibilities.

surfgatinho

10:53 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for that Alex, will have a look

trillianjedi

11:00 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Check you're not serving googlebot with Session ID's....