Forum Moderators: open

Message Too Old, No Replies

More to add to your ban list

and a question for RewriteEngine experts

         

Kerrin

4:59 am on Apr 12, 2003 (gmt 0)

10+ Year Member



I've seen these over the past week:

64.231.231.50 - Missauga Locate 1.0.0
24.162.46.10 - Mac Finder 1.0.34
24.27.87.211 - Industry Program 1.0.4
66.118.180.56 - Industry Program 1.0.5
64.169.241.214 - Program Shareware 1.0.2
68.96.97.151 - Program Shareware 1.0.1
67.112.140.73 - Program Shareware 1.0.2

Does anyone have example RewriteCond code which will deny access if all of the following are blank? Thanks.

HTTP_REFERER
HTTP_X_FORWARDED_FOR
HTTP_VIA

dmorison

5:31 am on Apr 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Kerrin,

I'm new here and have come across a number of threads about "banning" certain UA from your site.

Why are you doing this? Have those you list been "bad" in anyway?

If they are being bad I can understand somebody wishing to ban a robot that has not been seen before, but as I said in another thread, there was a day when you had not heard of GoogleBot either...

Just be careful you don't go and ban the "next big thing"...!

wilderness

12:58 pm on Apr 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<"all" of the following</snip>

Not sure what the other two are Kerrin?
At least they are not included in my NCSA logs or can I ever recall seeing them nentioned.

Here's for the referrer. I suppose if the other two exist you may be able to apply those as well.

RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^.*$ - [F]

These also gets the blank UA's the have the hyphen included.
Don

carfac

5:28 pm on Apr 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

If you read some of the posts here regarding banning, you will see that Don knows his stuff- his suggestion is right on, for what it does. However, Don is a bit more "draconian" than I am, and I temper his bans a little bit for my site. (Don knows this is not a slam- we just have different goals in banning, and thus we take different approaches!)

Rather than ban ALL blank UA's, I prefer to bann blank UA's that also have blank referers... I think this is a bit "safer".

You can do that like this:

RewriteRule /robots\.txt$ - [NC,L]
RewriteCond %{HTTP_REFERER} ^-?$[NC]
RewriteCond %{HTTP_USER_AGENT}^-?$[NC]
RewriteRule .*-[F,L]
RewriteCond %{HTTP_REFERER}NULL[NC]
RewriteCond %{HTTP_USER_AGENT}NULL[NC]
RewriteRule .*-[F,L]

The first line makes sure you do NOT ban ANY requests to robots.txt. The second and third set up the conditions for the ban at line four.... and five and six for the ban at line seven

To ban the UA's you nemtioned above, just do this:

RewriteCond %{HTTP_USER_AGENT} ^Missauga[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Industry\ Program [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Program\ Shareware[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mac\ Finder [NC]
RewriteRule ^/.*http://unused.ip.add.ress [R=301,L]

Which rewrites to a black hole, or change the last line to:

RewriteRule ^/.*-[F]

To 403 them.

Note that this forum removes spaces... you NEED a space between the UA name and the "[NC" or between the "-" and the "[F]"

Good Luck!

dave

Kerrin

6:29 pm on Apr 12, 2003 (gmt 0)

10+ Year Member



Hi dmorison,

Welcome to webmasterworld! The only spiders I allow to crawl my servers are ones from search engines:

googlebot (google.com)
fast (alltheweb.com)
slurp (inktomi.com)
Ask Jeeves (askjeeves.com/teoma.com)
Scooter (altavista.com)

All other crawlers are not welcome and are blocked using robots.txt [searchengineworld.com]. The problem with robots.txt is that spam crawlers, e-mail harvisters trademark protection and site scrapers, usually do not respect robots.txt.

The User Agents I listed above ignored robots.txt and tried to crawl my sites anyway (they didn't get very far though). They are lightly to be e-mail harvisters.

wilderness, those environment variables are useful in identifying the original ip from users behind a proxy:

HTTP_X_FORWARDED_FOR = If a proxy is used this lists real IP address of user
HTTP_VIA = If a proxy is used this lists the name of the proxy used

What I have found is that if all variables are blank other than IP & UA then 99% of the time it's an automated crawler of some kind.

Using plain english, I would like to do the following using RewriteEngine:

IF HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank ANDNOT (googlebot OR fast OR slurp OR Ask Jeeves OR Scooter) THEN Deny Access

And the follwing if UA is blank as well (i.e. only the IP address is available to be logged):

IF HTTP_USER_AGENT is blank AND HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank THEN Deny Access

Any ideas or pointers to RewriteEngine turorials which cover this? Thanks.

wilderness

7:39 pm on Apr 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<snip>Any ideas or pointers to RewriteEngine turorials which cover this?</snip>

Kerrin,
Perhaps when Jim comes along he may assist.

I'm thinking that a list the conditions would be traveled by the order in they are listed in htacess (just as long as the end statemnt [which I believe is "L"] is NOT used.
The downside is that it would take out all three line excpetions individually as well :(

Jim provided an example in the faked google thread of having data in htaccess meet two requirements.
[webmasterworld.com...]

Those rules (referrer and User agent) would only be limited by what ever module you are working with. As a result they could be changed to any of 5-6 names allowed as fields.

Getting the conditions to meet three criteria?

Don

dmorison

11:29 am on Apr 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Kerrin,

Why are good bots other than the search engines you list not welcome? This is the part I don't understand.

Of course it's your site and you can allow / disallow who you please, but the point i'm making is that if a bot is following robots.txt and all the unwritten rules of being a good bot that go with it (such as not hitting a site every millisecond) then why ban it?

There was a day when Googlebot crawled out of a non descript IP address at Stanford University, and you had never heard of Larry Page.

Are you happy with the possibility that you may be excluding your website from being a part of the "next big thing" in Internet search?

Google won't last forever, and for that reason alone i'm happy to let research bots onto my sites.

Cheers.

jdMorgan

5:06 am on Apr 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Kerrin,

HTTP_X_FORWARDED_FOR
HTTP_VIA

I don't see these two variables listed in mod_rewrite's environment variable list. Therefore, you'd have to use a special form of RewriteCond to test them. I asked for a second opinion via sticky, but no response so far.

>> IF HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank ANDNOT (googlebot OR fast OR slurp OR Ask Jeeves OR Scooter) THEN Deny Access

If I'm right about the special RewriteConds, the basic format of it would be:


RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{ENV:HTTP_X_FORWARDED_FOR} ^$
RewriteCond %{ENV:HTTP_VIA} ^$
RewriteCond %{HTTP_USER_AGENT} !^Googlebot
RewriteCond %{HTTP_USER_AGENT} !^FAST
RewriteCond %{HTTP_USER_AGENT} !^Slurp
RewriteCond %{HTTP_USER_AGENT} !^Ask\ Jeeves
RewriteCond %{HTTP_USER_AGENT} !^Scooter
RewriteRule .* - [F]

If those two external environment valriables are not defined, I'm not sure what will happen - use at your own risk.

You can combine all those user-agents in one line if you want; I'm just showing them separately for ease of checking.

Reference: Introduction to mod_rewrite [webmasterworld.com]

Jim

wilderness

6:20 am on Apr 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Jim.

Jaf

12:16 am on Apr 22, 2003 (gmt 0)

10+ Year Member



See the thread guestbook spammers [webmasterworld.com] for a longer list of agents of this type.

Kerrin

12:59 am on Apr 25, 2003 (gmt 0)

10+ Year Member



Thanks Jim, very much appreciated, I'll play around with the code :)

dmorison, it's all to do with running a tight ship and saving bandwidth/server resources. I don't like blocking smaller SEs or research bots but I won't get much traffic from them. If I let Google or another major SE crawl 100,000 pages, it's a worthwhile expense because I can expect to see thousands of visitors in return.

Also, i'm culturally obliged to be tight with my money: I'm Scottish ;)