Forum Moderators: open

Message Too Old, No Replies

How can I slow Googlebot down?

Googlebot is overloading my site

         

Unversed

10:04 am on Dec 8, 2002 (gmt 0)

10+ Year Member



I have a luxury problem. I think I'm being deep crawled by Google with multiple instances of Googlebot hitting tens of thousands of pages. Great.

My webhost has suspended my site because they say that multiple instances of the perl scripts that serve the pages are consuming too much resource on the server. Not great.

I changed robots.txt to restrict the kinds of pages that get spidered but that still leaves many thousands and in any case my host suspended the site again before Google revisited robot.txt.

Obviously I want my site spidered but not at the expense of closing the site. Obviously I want my site to stay up but not at the expense of frightening off Google.

Any ideas?

kfander

2:07 pm on Dec 8, 2002 (gmt 0)

10+ Year Member



I think I'd find a new webhost before I restricted Google, but that's me.

Dreamquick

2:19 pm on Dec 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Can I make a suggestion?

At the moment you sound v. vague about what you think the cause is - unless you actually know what the problem is then there is no really effective way to offer help. It could be search engines going hyper (in which case you need to contact them), people copying your site (often lots of requests at high speed), a bug in certain scripts or anything else in between!

Review the site logs - what was really hitting your site the heaviest over the period you are interested in? It might be the googlebot, it might not...

The key things to look for are which pages are being hit, how fast they are being hit and what is requesting them (User-Agent & IP). This way you might be able to narrow down the problem to one or two types of script which are the major resource drains.

Assuming that there is no readily apparent cause (ie it might only be a handful of traffic which caused your problems) then your next step would be to stress test a local version of your site with the same level of traffic you were seeing from your logs (using stress test utils), monitoring how much CPU usage and disk access is happening.

From here you can again start to narrow it down and hopefully you will eventually find the root cause of your problems.

- Tony

Unversed

2:40 pm on Dec 8, 2002 (gmt 0)

10+ Year Member



Apologies if I wasn't specific enough.

Googlebot crawlers are taking 5 or 6 pages a second from my site. The logs show pages of Googlebot before you see a line from another spider or - heaven help us - a real visitor. I'm not sure my host is being unreasonable, considering the other sites that are sharing the server. I thought I saw somewhere that Googlebot would not hit more than once every 2 seconds, which would probably be OK.

I have emailed Google but it's Sunday...

mack

3:21 pm on Dec 8, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



make a quick check on your logs to make sure it is google, last month I had a rouge trying to mask as UA googlebot. seams like very strange behaviour for the real googlebot.

jomaxx

4:47 pm on Dec 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google is only one of many spiders in the www ecosystem, and it's one of the better-behaved ones. I can assure you that as your site builds more incoming links, this problem will arise more and more often.

Here are some ideas that might help:

- Optimize your Perl scripts or - much much better - rewrite them as C programs for a 95% performance boost. PHP is probably also much faster and resource-efficient but I don't have first-hand experience with it.

- Don't have plain HREF links to scripts unless you want the output to be indexed. Many spiders ignore the robots.txt file, but few follow links within Javascript and virtually none will try to submit a form.

- Use a .htaccess file to prohibit spambots and personal spiders like Teleport Pro, WebStripper and so on. A little searching on Google will bring up some sample .htaccess files. Be sure to TEST the .htaccess file, because the syntax is difficult and I find that its behaviour can be non-obvious.

- In extreme examples, use the .htaccess file to block IP's that are doing persistent spidering. Do a WHOIS lookup on the IP first, to make sure you know who you're prohibiting.

andreasfriedrich

5:07 pm on Dec 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Optimize your Perl scripts or - much much better - rewrite them as C programs for a 95% performance boost. PHP is probably also much faster and resource-efficient but I don't have first-hand experience with it.

The benchmarking Yahoo did seems to suggest that mod_php is NEITHER faster and NOR more resource-efficient than mod_perl.

[webmasterworld.com...]

Andreas

jomaxx

5:35 pm on Dec 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that. I had foolishly assumed that because PHP seemed newer it would be an improvement over Perl, which can be amazingly slow.

optisoft

7:38 pm on Dec 8, 2002 (gmt 0)

10+ Year Member



I had a similar problem.. had a PR of 7 and was getting 100k good hits from google a day. One day out of the blue my webserver decided to shut down my site, without prior warning. Within 2 days my site was dropped from google, pr was lost and here I am fighting to get back up. My biggest problem was I didnt have a hardcopy of my site so I had to start from scratch. I would have negotiated a larger scale deal with my hosting server but the fact that take such drastic newbish actions I will never use them again and warn my friends not to use them. If you are stepping into the big leagues you need to be working with people who also have the capability to function in the big leagues too. If I were in your shoes I would find a new hosting company - one that you can trust and start tweaking your site to be more efficient.

I dont think trying to prevent googlebot from doing its thing is the right coarse of action, its possably the best thing you have going with your site right now

Unversed

1:03 am on Dec 9, 2002 (gmt 0)

10+ Year Member



Thanks for all the responses here. I emailed Google and they sent a nice email back saying they would reduce the crawl rate for my site. I'm still waiting for the web host to unsuspend me so I can't tell you how effective this was.

taxpod

3:04 am on Dec 9, 2002 (gmt 0)

10+ Year Member



Souns like you are on a shared server. Is that the case? Get yourself a dedicated server and these sorts of problems should go away.

jomaxx

3:29 am on Dec 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Keep in mind that Googlebot isn't the root problem. You'll have this problem again and again unless you optimize the scripts, reduce spidering, and/or get a dedicated server. I've even had my dedicated server buried by especially badly behaved robots.