|Googlebot brings my server down everyday :(|
googlebot ddos dos attack
I have a Dual Xeon with 4GB of RAM, and, I still have downtime problems. Further investigation showed that the culprit is Googlebot, it seems to grab the whole site at once, generating a huge amount of symetric httpd processes at the same time. From my terminal the load average can reach 50, then I need to reboot the poor server.
I need this web site indexed, but I don't want Google to take the server down everyday. What can I do? Are there special rules in robot.txt that I could use?
Here's what Google suggests:
|Q: Googlebot is crawling my site too fast. What can I do? |
A: Please contact us with the URL of your site and a detailed description of the problem. Please also include a portion of the weblog that shows Google accesses so we can track down the problem quickly.
Google Webmaster Support [google.com]
A dual xeon with 4 GB of ram should be able to handle Google when it comes calling unless you have scripts that use a lot of CPU or a process running that trashes the cache.
I would have thought your server would be able to handle it also.
You could tell google to crawl your site slower, via the google webmaster control (sitemaps).
Also if your using php, you could try some php caching like Xcache, or APC, etc
Googlebot used to bring my main sites and Internet connection down repeatedly ~8 years ago.
1) I sent them an email, and they tweaked things to be less toxic. You can still do the same now AFAIK.
2) In their Webmaster services you can ask the bot to crawl more slowly than it otherwise would. I do that on one of my sites that is really only a fallback, for example.
3) On the grounds that NO one remote entity should be able to bring your site down casually, put in behaviour-based controls that throttle the amount of traffic/load any one remote /24 (Class C) or-similar set of addresses/hosts/users can impose. This will save you from all sorts of other DoS grief too. (Won't help with DDoS, but not much will.)
4) Your system is more powerful than most of mine and yet I survive the Googlebot plus lots of other less well-behaved bots/scrapers/idiots. How? Partly (3) and partly by tuning the site code to keep the costs of most operations down, and cacheing the results of others. What is your normal page generation time? Seconds or milliseconds? If seconds then (a) you'll be irritating humans and (b) you won't keep up with most spiders' demands either.
What do you have like a billion pages? With that server I'd think you'd be able to survive for quite some time even under a prolonged attack.
Sign up to webmastertools is my advice and there you can select a slower crawl rate. But to be honest if my hosting couldn't handle googlebot think what happens when you have a few visitors. Your hosting should be your priority to look at not google.
Are you sure it's the Googlebot, and not some problem with your site's software or server?
We get nailed constantly by Google's robots, in a tripple whammy, the indexer, the Google News bot (we're a news source), and the AdWords bot. And we're on a shared server at Pair.com with hundreds of other users. No problems at all.
If you're using PHP 3 or some really old, inefficient softare, maybe that's it. Or try caching, such as jpcache.
Thank you guys. It seems it solved the problem. I don't get these Googlebot deluges of connections at the same time anymore. I asked Google to be lighter with my server.