Crawlers Slowing Server Down - Website Technology Issues forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Crawlers Slowing Server Down

How can I keep things fast for my human users?

corpuscle

9:05 pm on Mar 19, 2003 (gmt 0)

10+ Year Member

I have a server which is getting *battered* by googlebot, fast crawler, inktomi's crawler et al. Which is great news of course.

Unfortunately, when they're all descend at once, my server's response times deteriorate badly.

CPU is my bottleneck here - not bandwidth.

I want to try and limit the amount of CPU available for handling major crawler requests so that I can keep enough CPU free for my real-life users. Of course, I don't want to BAN the spiders - I just want to restrict the intensity of their crawling.

I'm running linux and apache. I've thought about mod_throttle, but it's not quite right (as I say, it's CPU usage, not bandwidth, that I want to control).

I've been thinking about running an additional httpd on a different port (say one on port 81 as well as the usual port 80) with a different "nice" values and redirecting spiders coming to port 80 to the "nicer" process on port 81 so that I have sufficient CPU for my other users. I don't think this will work though, since once the pages are indexed, the "wrong" port (81) server will appear in search results and I'll be back where I started.

The only other approach I can think of is to set up a "proxy" httpd instead of my normal httpd on port 80, and then set up two other httpds wth appropriate nice values on ports 81 and 82, say, and proxy requests to either machine based on the useragent/remote_ip.

Are there any other simpler approaches that anyone could suggest before I go down this route?

peterdaly

9:23 pm on Mar 19, 2003 (gmt 0)

10+ Year Member

I would look at the architecture of your web application, and see how it can be made more scaleable.

Googlebot hits heavier than all the other bots, and it's currently hitting my sites at about 14 hits per minute, or about 1 hit every 4.5 seconds. That is really not that much for a well designed application. 15 minute load avg. for me right now is .25 on a 1.3gig Celeron server.

I would focus on making your dynamic pages able to handle the load, rather than how to decrease spidering activity.

<Added>
Every single page google has been grabbing for the last few days is generated dynamically upon request, so it's not like I am shooting our static pages.

I have been through this as well. The last two Google crawls I had to improve my application to handle the load. Once to handle the google crawl, then again to handle the google crawl with the traffic the previous crawl brought.
</Added>

-Pete

corpuscle

9:46 pm on Mar 19, 2003 (gmt 0)

10+ Year Member

Thanks for the suggestion - I'm pretty unlikely to be able to optimise much more however; I've optimised about as far as I can:

Each page takes 0.25-0.5 seconds of CPU time to generate (optimised from what used to be 1 second or more), and I get bursts of activity of up to 5 requests per second for minutes at a time - which clearly the CPU won't handle. At these peak times, load average regularly hits 100+ before quieting down to normal levels.

Getting a new server or a better CPU is one step, but most of the time my response is fine; it's just the peaks that I can't deal with effectively right now...

peterdaly

12:40 am on Mar 20, 2003 (gmt 0)

10+ Year Member

What is the application written in, and what CPU are we talking about? There may be way to implement caching at different levels to increase performance.

For example, a dynamic portion of the page which is the same for everone could be made static, and the static portion could be updated once a minute/hour/day or something.

That has brought me huge performance gains. I'd be willing to offer suggestions if I can get more details about the application.

StickyMail me the address even, and I can give it a quick look-see.

-Pete

peterdaly

1:07 am on Mar 20, 2003 (gmt 0)

10+ Year Member

Another thought...try emailing google. They may be able to help.

[google.com...]

-Pete

carfac

4:07 am on Mar 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

corpuscle:

I think I have a solution for you- a Throttle based on CPU usage. It will ban anything that uses over a specified amount of CPU time in a specified period for a specified time (In English, it defaults to using 15% of CPU over 15 seconds, you will be blocked for 10 minutes.)

It blocks with a kind 503 error (server overlimit)

Just google "Stonehenge::Throttle" and that should do it for you!

dave