Welcome to WebmasterWorld Guest from 126.96.36.199
I need this web site indexed, but I don't want Google to take the server down everyday. What can I do? Are there special rules in robot.txt that I could use?
Q: Googlebot is crawling my site too fast. What can I do?
A: Please contact us with the URL of your site and a detailed description of the problem. Please also include a portion of the weblog that shows Google accesses so we can track down the problem quickly.
Google Webmaster Support [google.com]
Googlebot used to bring my main sites and Internet connection down repeatedly ~8 years ago.
1) I sent them an email, and they tweaked things to be less toxic. You can still do the same now AFAIK.
2) In their Webmaster services you can ask the bot to crawl more slowly than it otherwise would. I do that on one of my sites that is really only a fallback, for example.
3) On the grounds that NO one remote entity should be able to bring your site down casually, put in behaviour-based controls that throttle the amount of traffic/load any one remote /24 (Class C) or-similar set of addresses/hosts/users can impose. This will save you from all sorts of other DoS grief too. (Won't help with DDoS, but not much will.)
4) Your system is more powerful than most of mine and yet I survive the Googlebot plus lots of other less well-behaved bots/scrapers/idiots. How? Partly (3) and partly by tuning the site code to keep the costs of most operations down, and cacheing the results of others. What is your normal page generation time? Seconds or milliseconds? If seconds then (a) you'll be irritating humans and (b) you won't keep up with most spiders' demands either.
We get nailed constantly by Google's robots, in a tripple whammy, the indexer, the Google News bot (we're a news source), and the AdWords bot. And we're on a shared server at Pair.com with hundreds of other users. No problems at all.
If you're using PHP 3 or some really old, inefficient softare, maybe that's it. Or try caching, such as jpcache.