Ban spider by connection: close Header, (idea)

Forum Moderators: open

Message Too Old, No Replies

Ban spider by connection: close Header, (idea)

I got an idea by ban spider by connection: close header

Eric

9:43 am on Mar 16, 2008 (gmt 0)

Hi there

These days lots of spider crazy claw the site. I have an idea here, I am not using it yet, too risky, so wanna your view.

If (UserAgent = IE/Firefox/Opera)
And connection = close (in Header)
And webserver default setting is: keep-alive

Then I think it's a spider, how you think?

Ocean10000

3:20 pm on Mar 16, 2008 (gmt 0)

Eric

Connection: Close

is a valid header that can be returned by Firefox/Opera/IE. I usually only see it done when going though a proxy server or similar service. So I wouldn't block a client just on this alone, but you can use it as a tip off to do more checks.

I better header to check is the "Accept" Header which the major web browsers use, and again some proxy servers will remove it. But its easier to check for common proxy server headers. If the Accept Header is missing its more then likely a Bot or Client coming though a proxy server. FYI Mobile browsers are a crap shoot at including the Accept header.

I use the Accept Header method to remove a lot of spoofing bot traffic on my site. It keeps my bandwidth bill light and scrapers moving on to easier targets.

Ocean.

[edited by: Ocean10000 at 3:22 pm (utc) on Mar. 16, 2008]

Eric

6:41 pm on Mar 16, 2008 (gmt 0)

I agree with you that "Accept" and "Accept-Language" are good header for detect spoofing bot.

I have a problem of spoofing IE7.

Here is the header:

Connection: Close
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Accept-Language: en-us
Host: www.mydomain.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
UA-CPU: x86

Look pretty right? The problem is, my robot trap will catch this header from lots of different IP on a short period of time, and strange thing is, no matter where the IP from, the Accept-Language is en-us, and the Accept is same means they must has same installed software, and User-Agent also same means they even have same patch!

To odd to not to think it's a faked IE7. Any ideas?

So I think I will block a ip if:
Connection = Close
and Accept = image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
and Accept-Language = en-us
and User-Agent = Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)

How you think?

wilderness

7:05 pm on Mar 16, 2008 (gmt 0)

NET CLR 1.1.4322

The majority of these have been supereded by two additional updates.

.NET CLR 2.0.50727; .NET CLR 3.0.04506.30

Lord Majestic

7:53 pm on Mar 16, 2008 (gmt 0)

Check for declared support for gzip - all modern browers support it, so this would cut down your bandwidth costs anyway, and anything that does not support gzip is highly likely to be a bot - those well written bots that support gzip won't cost you as much.

incrediBILL

9:37 pm on Mar 16, 2008 (gmt 0)

Blocking on gzip is ill advised unless you want to just knock off visitors with old machines that never upgraded. They are rapidly becoming the minority but you'll still lose legit visitors.

I do block based on some invalid accept and connection headers but typically I profile the visitor in real-time and block bot-like behavior such as:

- Fall into my spider trap
- Use invalid SGML/HTML in URIs
- Speed traps to stop accesses too fast for a human
(make sure you disable pre-fetch in htaccess first)
- Ask for robots.txt file
- Doesn't ask for CSS, .js, .gif and .jpg files
- so on and so forth, etc.

FWIW, some of the bots think they're clever and ask for all the files on the first page they access then take off screaming through the site and get immediately busted for speed.

Others are a bit smarter and go a bit slower and ask for hundreds of pages over days.

However, doing a test for humans at the control after certain odd behavior happens when the average number of page views is skewed has so far been the only clear cut way to block the stealth bots.

I don't use the squiggly captchas, I use a combination of real text (for handicapped accessibility) plus javascript tests for actual typing to thwart blow-thru techniques and so far it seems to keep the bots out quite nicely and humans rarely have an issue.

Lord Majestic

9:53 pm on Mar 16, 2008 (gmt 0)

Support for GZIP was working in Netscape 4 (that's like 10 years old?) and IE 4, Opera 4+. That's like what, 99% of visitors? Put out captcha for the rest with a link to upgrade their browser or maybe some funny javascript redirect that would cut off anything that does not support it.

Eric

10:12 pm on Mar 16, 2008 (gmt 0)

Hi incrediBILL, interested your technology, are you ban the IPs in real time or after analysed logs? What type of software are you using?

Ocean10000

10:53 pm on Mar 16, 2008 (gmt 0)

As for incrediBILL he bans in real time, not after analyzing the log files. I figured I would reply for Bill to save him the time.

Eric

1:47 am on Mar 17, 2008 (gmt 0)

ok, what's software or scripts he is using? I also want one

[edited by: Eric at 1:48 am (utc) on Mar. 17, 2008]

incrediBILL

2:19 am on Mar 17, 2008 (gmt 0)

You can try AlexK's speed trap
[webmasterworld.com...]

I think the final version is posted here but his host seems to be offline at the moment:
[modem-help.freeserve.co.uk...]

It was there a few days ago, should come back.

wilderness

2:22 am on Mar 17, 2008 (gmt 0)

[web.archive.org...]

Make sure you have java on for best effect (cough, cough, gag, gag)

Eric

3:45 am on Mar 17, 2008 (gmt 0)

Thanks, I research this idea and put it in my ASP code.
And I also give HTTP/1.0 503 Service Unavailable error.

My question is, for example, if googlebox was in this trap and will it delete the page already in then database? Or they will keep the old version and try get new stall next time?

(I guess they will keep old one untill another http200, what's your idea, or is there any official declares?

Eric

4:05 am on Mar 17, 2008 (gmt 0)

and guys, do me a favour, check your log and tell me the interval of each request for html file, and how many total pages you have in your website.

here is my data, about 1000 webpages, about 12 seconds interval between each request form google.

incrediBILL

9:04 am on Mar 17, 2008 (gmt 0)

Google, Yahoo, MSN, Ask, etc. don't trip my speed traps because I do round trip DNS to validate the bots and let them in no questions asked. Anything that's not one of my whitelisted bots is subject to speed traps and a whole lot more.

Basically my security goes like this:

1. search engines get a free pass into the site, no further scrutiny

2. anything that's not MSIE/Firefox/Opera gets bounced, almost all other unwanted junk bounces here

3. anything claiming to be MSIE/Firefox/Opera get further scrutiny, such as speed traps, spider traps, bad headers, bad SGML/HTML, etc.