Forum Moderators: open

Message Too Old, No Replies

Banning the Windows 98 user agent

and stopping 60% of scrapers

         

Hobbs

12:26 pm on Nov 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If by banning the Windows 98 user agent you could stop over 60% of the scrapers on your site but lose less than 5% of your visitors, would you do it?

This user agent is driving me nuts:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

Really at the end of my rope down here, and not technical enough to implement a bot detection & banning script.

wilderness

8:17 am on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Only you may determine what's in the best interests of your website (s).

There was a recent thread on this as well:
[webmasterworld.com...]

DamonHD

10:01 am on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hobbs,

I's say that the UA is so easily forged that it can only be one of a number of simultaneous approaches, along with IP-based screening and behaviour-based blocking.

(I've written up what I *do* do elsewhere for you.)

Rgds

Damon

Hobbs

11:41 am on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My kingdom for an easy to isntall/configure/operate/monitor "behaviour-based blocking" script, there are many floating around but ZERO somone like me can install.

A 'bot blocking for dummies' eBook would make a million ;-)

londrum

4:57 pm on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



there's one in the PHP forum on this site that is pretty easy to set up - it's in the library section, called 'Blocking Badly Behaved Bots #3'

all you really need to do is include a file at the top and bottom of your page and it will block stuff based on their behaviour (the speed which they download stuff at, the number of pages they access per minute etc), exactly like you want.

Hobbs

5:09 pm on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



londrum is talking about this thread:
[webmasterworld.com...]

yeah have it flagged already, do you have any idea how to install it in plain vanilla html pages?

jdMorgan

6:27 pm on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could also block that MSIE user-agent string and others that are similarly spoofed/invalid using mod_rewrite in .htaccess. Something like:

RewriteCond %{HTTP_USER_AGENT} MSIE.+Windows [NC]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/4\.[0-9]+\ \(compatible;\ MSIE\ [3-9]\.[0-9]{1,2}(;\ [^;]+)*;\ Windows\ (NT\ (4\.0¦5\.(01?¦1¦2)¦6\.0)¦98;\ Win\ 9x\ 4\.90¦98¦95)(;\ [^;]+)*\)
# Following line allows screwed-up syntax "MSN 9.0;MSN 9.1" user-agent
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/4\.[0-9]+\ \(compatible;\ MSIE\ [3-9]\.[0-9]{1,2}(;\ [^;]+)*;\ Windows\ (NT\ (4\.0¦5\.(01?¦1¦2)¦6\.0)¦98;\ Win\ 9x\ 4\.90¦98¦95)(;\ [^;]+)*;\ +MSN\ 9\.0;MSN\ 9\.1(;\ [^;]+)*\)
RewriteRule .* - [F]

Above code taken from a live server, but modified/simplified for this specific problem.

Replace all broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

Jim

[edited by: jdMorgan at 6:45 pm (utc) on Nov. 3, 2007]

londrum

7:39 pm on Nov 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



do you have any idea how to install it in plain vanilla html pages?

you'd have to set your server up to parse pages with an .html extension as php.
then you can just include a php code block at the top and bottom.

if you've got access to your .htaccess file then i think you can just add one simple line to it... but i don't use it myself, so i don't know what it is! maybe someone else will chime in with it.

volatilegx

12:22 am on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



AddType application/x-httpd-php .php .htm .html

Add the above line to an .htaccess file and upload it to your directory. It will make it so *.htm, *.html and *.php are all parsed for PHP.

Hobbs

11:49 am on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great,
Anyonone with a good link that details how to structure the .htaccess file and add things like RewriteCond % without conflicting with the below.
(I can't make any sense of the Apache documentation)

This is what I currently have:

AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/home/dir/public_html/botrdns.php"

SetEnvIfNoCase User-Agent "somebotUA" bad_bot
SetEnvIfNoCase User-Agent "someotherbotUA" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from aaa.bbb.ccc.ddd

Hobbs

12:14 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is the Apache tutorial that does not expalin how to mix directives and structure well, and Googling it was no help

Apache Tutorial: .htaccess files
[httpd.apache.org...]

vincevincevince

1:37 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hobbs, what you have now seems fine. In general structure isn't awfully important as few things are order-dependant; other than multiple directives within the same module.

wilderness

5:27 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hobbs, what you have now seems fine. In general structure isn't awfully important as few things are order-dependant; other than multiple directives within the same module.

Actually, some method and consistency (such as remark lines, which I personally don't use) are quite useful. Especially in the event that your required to pour over line after line of an htaccess file, searching for a syntax error because an addition has created a 500 error taking down your entire website (s).

AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/home/dir/public_html/botrdns.php"

SetEnvIfNoCase User-Agent "somebotUA" bad_bot
SetEnvIfNoCase User-Agent "someotherbotUA" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from aaa.bbb.ccc.ddd

These files work in a vareity of fashions (similar in that manner to the html behind web pages).

More organized-asoccicated in the following:

AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/home/dir/public_html/botrdns.php"

<Limit GET POST>
SetEnvIfNoCase User-Agent "somebotUA" bad_bot
SetEnvIfNoCase User-Agent "someotherbotUA" bad_bot
Order Allow,Deny
deny from aaa.bbb.ccc.ddd
Allow from all
Deny from env=bad_bot
</Limit>

perhaps another may provide a more effective positioning of the following lines?
As I don't use these files, it appears to be both duplication and conflict, however I could be mistaken,

<Files 403.shtml>
order allow,deny
allow from all
</Files>

Don

keyplyr

7:59 am on Nov 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wasn't sure if I had many users running Win98 or not but did have a few pests spoofing so I tried:

RewriteCond %{HTTP_USER_AGENT} 98)$
RewriteRule .* - [F]

Checked logs several hours later and found I'd blocked a couple dozen legit users (USA, Brazil, Mexico...) Apparently my sites attract the antiquated.

Hobbs

11:25 am on Nov 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's how I finally banned Windows 98:

SetEnvIfNoCase User-Agent "Windows\ 98\)$" bad_bot
SetEnvIfNoCase User-Agent "win98" bad_bot

jdMorgan,
Thanks for the code, I inserted it in a test site and it is working fine so far.

Can you blend SetEnvIfNoCase Remote_Host and User-Agent in one line?

Say for example I want to only block "validexample" user agent from IP 1.2.3.4 which is a proxy in this case sending many visitors but I only need to block one of them that has a valid user agent.

wilderness

8:08 pm on Nov 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It cannot be accomplished in a single line (at least that I'm aware of), however I've provide an example in the following:

[webmasterworld.com...]

Hobbs

7:34 am on Nov 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks wilderness
from that link:

RewriteCond %{HTTP_USER_AGENT} validUA
RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.
RewriteRule .* - [F]

I am not sure if that would work in multiples, i.e. blocking multiple valid UA's from multiple proxies, that's why I was hoping for a line of SetEnvIfNoCase for each one, e.g.

SetEnvIfNoCase User-Agent "UA1" and Remote_Host "1.2.3.4" bad_bot
SetEnvIfNoCase User-Agent "UA2" and Remote_Host "5.6.7.7" bad_bot

wilderness

9:35 am on Nov 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Multiple UA's and Multiple IP's.

RewriteCond %{HTTP_USER_AGENT} (validUA1¦validUA2)
RewriteCond %{REMOTE_ADDR} ^123\.456\.789\. [OR]
RewriteCond %{REMOTE_ADDR} ^234\.567\.891\.
RewriteRule .* - [F]

(Please note that the forum breaks the pipe character and needs correction.)

Your may also utilize UA-begins with, ends with or conatains as options for your UA keyword, however would not suggest attempting to mix begins with, ends with or conatains in the same criteria line.

Hobbs

9:50 am on Nov 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi,
Correct me if I am wrong, you are banning both user agents and both IP, how do you ban only validUA1 from IP1 and only validUA2 from IP2?

wilderness

9:56 am on Nov 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You would be denying either UA,
from either IP range.

Your previous inquiry:

I am not sure if that would work in multiples, i.e. blocking multiple valid UA's from multiple proxies

In the event that you desire single entries (per IP and UA) than just use the example your copied from the aforemwntioned link example.

edited: BTW, the benefit the multiple IP range is that you have the capability of adding as many ranges as you desire.