|Ban spider by connection: close Header, (idea)|
I got an idea by ban spider by connection: close header
These days lots of spider crazy claw the site. I have an idea here, I am not using it yet, too risky, so wanna your view.
If (UserAgent = IE/Firefox/Opera)
And connection = close (in Header)
And webserver default setting is: keep-alive
Then I think it's a spider, how you think?
is a valid header that can be returned by Firefox/Opera/IE. I usually only see it done when going though a proxy server or similar service. So I wouldn't block a client just on this alone, but you can use it as a tip off to do more checks.
I better header to check is the "Accept" Header which the major web browsers use, and again some proxy servers will remove it. But its easier to check for common proxy server headers. If the Accept Header is missing its more then likely a Bot or Client coming though a proxy server. FYI Mobile browsers are a crap shoot at including the Accept header.
I use the Accept Header method to remove a lot of spoofing bot traffic on my site. It keeps my bandwidth bill light and scrapers moving on to easier targets.
[edited by: Ocean10000 at 3:22 pm (utc) on Mar. 16, 2008]
I agree with you that "Accept" and "Accept-Language" are good header for detect spoofing bot.
I have a problem of spoofing IE7.
Here is the header:
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Look pretty right? The problem is, my robot trap will catch this header from lots of different IP on a short period of time, and strange thing is, no matter where the IP from, the Accept-Language is en-us, and the Accept is same means they must has same installed software, and User-Agent also same means they even have same patch!
To odd to not to think it's a faked IE7. Any ideas?
So I think I will block a ip if:
Connection = Close
and Accept = image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
and Accept-Language = en-us
and User-Agent = Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
How you think?
The majority of these have been supereded by two additional updates.
.NET CLR 2.0.50727; .NET CLR 3.0.04506.30
Check for declared support for gzip - all modern browers support it, so this would cut down your bandwidth costs anyway, and anything that does not support gzip is highly likely to be a bot - those well written bots that support gzip won't cost you as much.
Blocking on gzip is ill advised unless you want to just knock off visitors with old machines that never upgraded. They are rapidly becoming the minority but you'll still lose legit visitors.
I do block based on some invalid accept and connection headers but typically I profile the visitor in real-time and block bot-like behavior such as:
- Fall into my spider trap
- Use invalid SGML/HTML in URIs
- Speed traps to stop accesses too fast for a human
(make sure you disable pre-fetch in htaccess first)
- Ask for robots.txt file
- Doesn't ask for CSS, .js, .gif and .jpg files
- so on and so forth, etc.
FWIW, some of the bots think they're clever and ask for all the files on the first page they access then take off screaming through the site and get immediately busted for speed.
Others are a bit smarter and go a bit slower and ask for hundreds of pages over days.
However, doing a test for humans at the control after certain odd behavior happens when the average number of page views is skewed has so far been the only clear cut way to block the stealth bots.
Hi incrediBILL, interested your technology, are you ban the IPs in real time or after analysed logs? What type of software are you using?
As for incrediBILL he bans in real time, not after analyzing the log files. I figured I would reply for Bill to save him the time.
ok, what's software or scripts he is using? I also want one
[edited by: Eric at 1:48 am (utc) on Mar. 17, 2008]
You can try AlexK's speed trap
I think the final version is posted here but his host seems to be offline at the moment:
It was there a few days ago, should come back.
Make sure you have java on for best effect (cough, cough, gag, gag)
Thanks, I research this idea and put it in my ASP code.
And I also give HTTP/1.0 503 Service Unavailable error.
My question is, for example, if googlebox was in this trap and will it delete the page already in then database? Or they will keep the old version and try get new stall next time?
(I guess they will keep old one untill another http200, what's your idea, or is there any official declares?
and guys, do me a favour, check your log and tell me the interval of each request for html file, and how many total pages you have in your website.
here is my data, about 1000 webpages, about 12 seconds interval between each request form google.
Google, Yahoo, MSN, Ask, etc. don't trip my speed traps because I do round trip DNS to validate the bots and let them in no questions asked. Anything that's not one of my whitelisted bots is subject to speed traps and a whole lot more.
Basically my security goes like this:
1. search engines get a free pass into the site, no further scrutiny
2. anything that's not MSIE/Firefox/Opera gets bounced, almost all other unwanted junk bounces here
3. anything claiming to be MSIE/Firefox/Opera get further scrutiny, such as speed traps, spider traps, bad headers, bad SGML/HTML, etc.