Forum Moderators: open
64.231.231.50 - Missauga Locate 1.0.0
24.162.46.10 - Mac Finder 1.0.34
24.27.87.211 - Industry Program 1.0.4
66.118.180.56 - Industry Program 1.0.5
64.169.241.214 - Program Shareware 1.0.2
68.96.97.151 - Program Shareware 1.0.1
67.112.140.73 - Program Shareware 1.0.2
Does anyone have example RewriteCond code which will deny access if all of the following are blank? Thanks.
HTTP_REFERER
HTTP_X_FORWARDED_FOR
HTTP_VIA
I'm new here and have come across a number of threads about "banning" certain UA from your site.
Why are you doing this? Have those you list been "bad" in anyway?
If they are being bad I can understand somebody wishing to ban a robot that has not been seen before, but as I said in another thread, there was a day when you had not heard of GoogleBot either...
Just be careful you don't go and ban the "next big thing"...!
Not sure what the other two are Kerrin?
At least they are not included in my NCSA logs or can I ever recall seeing them nentioned.
Here's for the referrer. I suppose if the other two exist you may be able to apply those as well.
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^.*$ - [F]
These also gets the blank UA's the have the hyphen included.
Don
If you read some of the posts here regarding banning, you will see that Don knows his stuff- his suggestion is right on, for what it does. However, Don is a bit more "draconian" than I am, and I temper his bans a little bit for my site. (Don knows this is not a slam- we just have different goals in banning, and thus we take different approaches!)
Rather than ban ALL blank UA's, I prefer to bann blank UA's that also have blank referers... I think this is a bit "safer".
You can do that like this:
RewriteRule /robots\.txt$ - [NC,L]
RewriteCond %{HTTP_REFERER} ^-?$[NC]
RewriteCond %{HTTP_USER_AGENT}^-?$[NC]
RewriteRule .*-[F,L]
RewriteCond %{HTTP_REFERER}NULL[NC]
RewriteCond %{HTTP_USER_AGENT}NULL[NC]
RewriteRule .*-[F,L]
The first line makes sure you do NOT ban ANY requests to robots.txt. The second and third set up the conditions for the ban at line four.... and five and six for the ban at line seven
To ban the UA's you nemtioned above, just do this:
RewriteCond %{HTTP_USER_AGENT} ^Missauga[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Industry\ Program [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Program\ Shareware[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mac\ Finder [NC]
RewriteRule ^/.*http://unused.ip.add.ress [R=301,L]
Which rewrites to a black hole, or change the last line to:
RewriteRule ^/.*-[F]
To 403 them.
Note that this forum removes spaces... you NEED a space between the UA name and the "[NC" or between the "-" and the "[F]"
Good Luck!
dave
Welcome to webmasterworld! The only spiders I allow to crawl my servers are ones from search engines:
googlebot (google.com)
fast (alltheweb.com)
slurp (inktomi.com)
Ask Jeeves (askjeeves.com/teoma.com)
Scooter (altavista.com)
All other crawlers are not welcome and are blocked using robots.txt [searchengineworld.com]. The problem with robots.txt is that spam crawlers, e-mail harvisters trademark protection and site scrapers, usually do not respect robots.txt.
The User Agents I listed above ignored robots.txt and tried to crawl my sites anyway (they didn't get very far though). They are lightly to be e-mail harvisters.
wilderness, those environment variables are useful in identifying the original ip from users behind a proxy:
HTTP_X_FORWARDED_FOR = If a proxy is used this lists real IP address of user
HTTP_VIA = If a proxy is used this lists the name of the proxy used
What I have found is that if all variables are blank other than IP & UA then 99% of the time it's an automated crawler of some kind.
Using plain english, I would like to do the following using RewriteEngine:
IF HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank ANDNOT (googlebot OR fast OR slurp OR Ask Jeeves OR Scooter) THEN Deny Access
And the follwing if UA is blank as well (i.e. only the IP address is available to be logged):
IF HTTP_USER_AGENT is blank AND HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank THEN Deny Access
Any ideas or pointers to RewriteEngine turorials which cover this? Thanks.
Kerrin,
Perhaps when Jim comes along he may assist.
I'm thinking that a list the conditions would be traveled by the order in they are listed in htacess (just as long as the end statemnt [which I believe is "L"] is NOT used.
The downside is that it would take out all three line excpetions individually as well :(
Jim provided an example in the faked google thread of having data in htaccess meet two requirements.
[webmasterworld.com...]
Those rules (referrer and User agent) would only be limited by what ever module you are working with. As a result they could be changed to any of 5-6 names allowed as fields.
Getting the conditions to meet three criteria?
Don
Why are good bots other than the search engines you list not welcome? This is the part I don't understand.
Of course it's your site and you can allow / disallow who you please, but the point i'm making is that if a bot is following robots.txt and all the unwritten rules of being a good bot that go with it (such as not hitting a site every millisecond) then why ban it?
There was a day when Googlebot crawled out of a non descript IP address at Stanford University, and you had never heard of Larry Page.
Are you happy with the possibility that you may be excluding your website from being a part of the "next big thing" in Internet search?
Google won't last forever, and for that reason alone i'm happy to let research bots onto my sites.
Cheers.
HTTP_X_FORWARDED_FOR
HTTP_VIA
I don't see these two variables listed in mod_rewrite's environment variable list. Therefore, you'd have to use a special form of RewriteCond to test them. I asked for a second opinion via sticky, but no response so far.
>> IF HTTP_REFERER is blank AND HTTP_X_FORWARDED_FOR is blank AND HTTP_VIA is blank ANDNOT (googlebot OR fast OR slurp OR Ask Jeeves OR Scooter) THEN Deny Access
If I'm right about the special RewriteConds, the basic format of it would be:
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{ENV:HTTP_X_FORWARDED_FOR} ^$
RewriteCond %{ENV:HTTP_VIA} ^$
RewriteCond %{HTTP_USER_AGENT} !^Googlebot
RewriteCond %{HTTP_USER_AGENT} !^FAST
RewriteCond %{HTTP_USER_AGENT} !^Slurp
RewriteCond %{HTTP_USER_AGENT} !^Ask\ Jeeves
RewriteCond %{HTTP_USER_AGENT} !^Scooter
RewriteRule .* - [F]
You can combine all those user-agents in one line if you want; I'm just showing them separately for ease of checking.
Reference: Introduction to mod_rewrite [webmasterworld.com]
Jim
dmorison, it's all to do with running a tight ship and saving bandwidth/server resources. I don't like blocking smaller SEs or research bots but I won't get much traffic from them. If I let Google or another major SE crawl 100,000 pages, it's a worthwhile expense because I can expect to see thousands of visitors in return.
Also, i'm culturally obliged to be tight with my money: I'm Scottish ;)