Some spiders are so badly behaved all you can do is whack 'em with a blunt object.
Several offline readers are the same -- they'll hit your site at zillions of requests a second rather than pace themselves.
If you hadn't noticed in time -- like maybe you'd been asleep during the attack -- and you have a very large site, a couple of such attacks could use up your bandwidth quota.
Which is why I decided to go proactive. Rather than ban after the event, I monitor request rates by IP address. Any misbehavior gets the IP banned for anything between 10 minutes and 24 hours.
While banned, I send back a very short page saying (something like) "spider misbehaving" and quote their UA id etc, and explain what they were doing wrong.
The page has no links on it, so they soon run out of harvsted URLs to spider.
Then, later, I can use their search engine, find the "spider misbehaving" pages they've indexed, and email them to complain of the service they are giving their users.
Too late to edit the above:
There is a simlar thread in progress:
That offers some PHP code to dynamically block IPs that are rampaging. Don't use PHP myself, but looks like it could be useful to those who do.
Being a search engine operator, what types of throttling do most sites like?
I generally have an 8 second delay between page fetches for each domain and we try and keep an unsorted fetch list so that rule wouldn't apply over the scale of a 1 million page fetch.
The only issue i have yet to work on is possibly doing an IP lookup to see if we are hitting a bunch of hosts on a shared server or something similar, but that doesn't seem to be what you are complaining about.
|I generally have an 8 second delay between page fetches |
If that's all that was going on I probably wouldn't have lost my mind - 7 to 8 pages a minute is trivial.
I was watching as many as 4 pages a second get yanked off the site and the real problem is all my sites are database driven. When some serious amount of concurrent access hits the site then the disk churns and the CPU usage spikes to the point everything crawls. Doesn't crash mind you as it can handle the load, but trying to stop the abuse under a heavy load is like working in a time warp.
Victor - thanks for the link to the script, my site doesn't use PHP but I can easily adapt the concept.
Thanks for asking.
Generally, I do not want a bot sucking bandwidth / using CGI time any faster than a human would. Remember, the site could have many bots all clamoring for attention at any one moment
The bots have got to be reasonable and give priority to the humans. The bots can keep going 24x7 to get more pages than a human ever would.
I put a crawl-delay of 15 seconds in when Microsoft's beta bot started rampaging. crawl-delay is a non-standard robots.txt command that a few bots respect.
Perhaps you could sample some robots.txt files and see what crawl-delays are common....Or maybe ask Brett's robots' survey to check for that:
|The bots can keep going 24x7 to get more pages than a human ever would |
My primary site would generate between 40K-80k pages and if someone wanted that in a hurry it would be very problematic. The server that was hit yesterday averages about 300kbps all day long so when it suddenly skyrockets to 2mbps it's real easy to spot an issue. I actually have historical traffic graphs for all the my servers and domains with a 12 month history to see long term trends (and abuse spikes) and the amazing thing is this really doesn't happen that frequently based on the traffic graphs, nothing on this scale in the last 2 months anyway.
How are you guys tracking real time traffic? Are you running your own servers? HOW are you DOING IT?
I use a rather arcane tool called MRTG (Multi Router Traffic Grapher) that is updated in 5 minute increments.
I have it setup to page my cell phone when the CPU exceeds 90% or bandwidth spikes over 200% so I can catch abuse in real time. However, when you forgot you had the ringer turned off on the phone it doesn't help much :)
And yes, it's on my own server.
Can you use the mod_throttle to slow down the ip's eating up your resources? (may be more cpu intensive than it's worth).
I can't stand bots that do go nuts, and it is bad practice. Have you verified the bot is all from a single host or that it was simple they built a new fetch list that had everyone of your pages in it and you had a few dozen of the msn spiders hitting your site?
I typically run 4 spiders at once.. so i'm always thinking of ways to keep them behaving but sustain a fresh index. :)
|Can you use the mod_throttle to slow down the ip's eating up your resources? |
Nope - we used to use mod_throttle - but the author said he's not upgrading it to work in Apache 2.0 and I'm sure as heck not going back to Apache 1.3 so alternative strategies will need to be employed.
It wasn't a mainstream spider like MSN, it was one I didn't care about so I just blocked the IP.
|How are you guys tracking real time traffic? Are you running your own servers? HOW are you DOING IT? |
All my sites are dynamic, not static. That is, a browser or spider connects to a CGI script not a pre-made page.
The common code at the start of each of my scripts calls a bad-bot check routine.
That keeps a sliding window of the last 3 minutes worth of requests. If an IP address is on that list too many times, it gets banned for 5 minutes -- ie it'll get the "too fast" message" for any request in the next 5 minutes.
But if it's still spidering too fast when the 5 minute ban expires, it gets banned for 30 minutes -- with a slightly tougher message.
Then two hours.
Then 24 hours.
Anyone persistently stupid enough to get themselves into the 24-hour ban list, I review for manual eternal zapping.
That's a more complex scheme that than PHP solution (see message 3 in this thread). But it works for me, and it allows bad bots or offline readers to adjust their behavior and return later as a welcome guest.