Forum Moderators: phranque
Unfortunately, when they're all descend at once, my server's response times deteriorate badly.
CPU is my bottleneck here - not bandwidth.
I want to try and limit the amount of CPU available for handling major crawler requests so that I can keep enough CPU free for my real-life users. Of course, I don't want to BAN the spiders - I just want to restrict the intensity of their crawling.
I'm running linux and apache. I've thought about mod_throttle, but it's not quite right (as I say, it's CPU usage, not bandwidth, that I want to control).
I've been thinking about running an additional httpd on a different port (say one on port 81 as well as the usual port 80) with a different "nice" values and redirecting spiders coming to port 80 to the "nicer" process on port 81 so that I have sufficient CPU for my other users. I don't think this will work though, since once the pages are indexed, the "wrong" port (81) server will appear in search results and I'll be back where I started.
The only other approach I can think of is to set up a "proxy" httpd instead of my normal httpd on port 80, and then set up two other httpds wth appropriate nice values on ports 81 and 82, say, and proxy requests to either machine based on the useragent/remote_ip.
Are there any other simpler approaches that anyone could suggest before I go down this route?
Googlebot hits heavier than all the other bots, and it's currently hitting my sites at about 14 hits per minute, or about 1 hit every 4.5 seconds. That is really not that much for a well designed application. 15 minute load avg. for me right now is .25 on a 1.3gig Celeron server.
I would focus on making your dynamic pages able to handle the load, rather than how to decrease spidering activity.
<Added>
Every single page google has been grabbing for the last few days is generated dynamically upon request, so it's not like I am shooting our static pages.
I have been through this as well. The last two Google crawls I had to improve my application to handle the load. Once to handle the google crawl, then again to handle the google crawl with the traffic the previous crawl brought.
</Added>
-Pete
Each page takes 0.25-0.5 seconds of CPU time to generate (optimised from what used to be 1 second or more), and I get bursts of activity of up to 5 requests per second for minutes at a time - which clearly the CPU won't handle. At these peak times, load average regularly hits 100+ before quieting down to normal levels.
Getting a new server or a better CPU is one step, but most of the time my response is fine; it's just the peaks that I can't deal with effectively right now...
For example, a dynamic portion of the page which is the same for everone could be made static, and the static portion could be updated once a minute/hour/day or something.
That has brought me huge performance gains. I'd be willing to offer suggestions if I can get more details about the application.
StickyMail me the address even, and I can give it a quick look-see.
-Pete
I think I have a solution for you- a Throttle based on CPU usage. It will ban anything that uses over a specified amount of CPU time in a specified period for a specified time (In English, it defaults to using 15% of CPU over 15 seconds, you will be blocked for 10 minutes.)
It blocks with a kind 503 error (server overlimit)
Just google "Stonehenge::Throttle" and that should do it for you!
dave