|Can we schedule crawl times for googlebot?|
Short version: I would like to disable Google's indexing during certain days/times. I am also concerned about changing my robots.txt file and what the long term effects are (do they not come back?)
My company hosts monthly sales on our website where our traffic increases dramatically. Yesterday, the need for heavy load testing an optimization was pushed to the forefront during the perfect storm of heavy use and a Google crawl.
The fact that in just a couple hours, during our heaviest period of use, Google had downloaded well over a gigabyte of data (not to mention the stress on the SQL server), this was more than enough to push the system over the tipping point to horrible performance (1000% typical load time).
Does anyone have experience with similar problems? Were you able to find acceptable solutions while still maintaining good search results? We are currently top 3 on all our important search terms and phrases and I would hate to lose that. But if our site doesn't work, that is worse.
Hello JustBarno, and welcome to the forums.
I've been looking for the Google reference and can't locate it right now, but the essence of the answer is that no, it's not a good idea. This video gets close: [youtube.com...]
Usually the crawl team does a good job with allocating crawl resources in a way that doesn't hurt the server. You can ask googlebot to crawl more slowly, but that often has other, negative repercussions.
From your description of the problem, it sounds like Google needs to retrieve the full page for every request - database calls and all that. Have you considered server-side caching and then replying with a 304 status if the page hasn't changed?
Thanks Tedster, that video was very informative. Normally I think you're right that it wouldn't hurt our server, but the problem is that we were already near the tipping point. I'll look into the server side caching but our pages are almost constantly changing (new high bids, auctions closed) etcetera.
Was this a one time problem, a once-in-a-while problem, or a regular problem? If it happens more than a little bit, Google will also be experiencing the server delays and should adjust their crawl rate without you taking any action. At least that's the way it's supposed to work, and it often does.
Optimize, optimize, optimize!
I'm sure you've done all of that but run firebug and pagespeed to double check, before doing anything else you want to reduce the size of... everything.
As far as I know, before loading your pages Google bot always reads robots.txt. You look in your log files how often Google do this. If this happens several times a day, probably you will be helped by a script which during certain time will overwrite a file robots.txt, adding there the instruction crawl-delay. And after heaviest (peak) time will be passed, it will clear the rule, to let robot continue indexing with speed as before.
That might seem like an option - however that is the kind of thing Matt Cutts is warning about in the video I linked to above. Can I use robots.txt to optimize Googlebot's crawl? [youtube.com]
Well, if you are really desperate, you could do some IP/agent sniffing and serve a non-contented 503 to googlebot at the high times.
Bit risky, though...