Forum Moderators: open
My webhost has suspended my site because they say that multiple instances of the perl scripts that serve the pages are consuming too much resource on the server. Not great.
I changed robots.txt to restrict the kinds of pages that get spidered but that still leaves many thousands and in any case my host suspended the site again before Google revisited robot.txt.
Obviously I want my site spidered but not at the expense of closing the site. Obviously I want my site to stay up but not at the expense of frightening off Google.
Any ideas?
At the moment you sound v. vague about what you think the cause is - unless you actually know what the problem is then there is no really effective way to offer help. It could be search engines going hyper (in which case you need to contact them), people copying your site (often lots of requests at high speed), a bug in certain scripts or anything else in between!
Review the site logs - what was really hitting your site the heaviest over the period you are interested in? It might be the googlebot, it might not...
The key things to look for are which pages are being hit, how fast they are being hit and what is requesting them (User-Agent & IP). This way you might be able to narrow down the problem to one or two types of script which are the major resource drains.
Assuming that there is no readily apparent cause (ie it might only be a handful of traffic which caused your problems) then your next step would be to stress test a local version of your site with the same level of traffic you were seeing from your logs (using stress test utils), monitoring how much CPU usage and disk access is happening.
From here you can again start to narrow it down and hopefully you will eventually find the root cause of your problems.
- Tony
Googlebot crawlers are taking 5 or 6 pages a second from my site. The logs show pages of Googlebot before you see a line from another spider or - heaven help us - a real visitor. I'm not sure my host is being unreasonable, considering the other sites that are sharing the server. I thought I saw somewhere that Googlebot would not hit more than once every 2 seconds, which would probably be OK.
I have emailed Google but it's Sunday...
Here are some ideas that might help:
- Optimize your Perl scripts or - much much better - rewrite them as C programs for a 95% performance boost. PHP is probably also much faster and resource-efficient but I don't have first-hand experience with it.
- Don't have plain HREF links to scripts unless you want the output to be indexed. Many spiders ignore the robots.txt file, but few follow links within Javascript and virtually none will try to submit a form.
- Use a .htaccess file to prohibit spambots and personal spiders like Teleport Pro, WebStripper and so on. A little searching on Google will bring up some sample .htaccess files. Be sure to TEST the .htaccess file, because the syntax is difficult and I find that its behaviour can be non-obvious.
- In extreme examples, use the .htaccess file to block IP's that are doing persistent spidering. Do a WHOIS lookup on the IP first, to make sure you know who you're prohibiting.
Optimize your Perl scripts or - much much better - rewrite them as C programs for a 95% performance boost. PHP is probably also much faster and resource-efficient but I don't have first-hand experience with it.
The benchmarking Yahoo did seems to suggest that mod_php is NEITHER faster and NOR more resource-efficient than mod_perl.
[webmasterworld.com...]
Andreas
I dont think trying to prevent googlebot from doing its thing is the right coarse of action, its possably the best thing you have going with your site right now