Forum Moderators: phranque

Message Too Old, No Replies

Refining .htaccess file using mod-rewrite

         

sgibb

6:29 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



I wanted my .htaccess file to start with a whitelist and have adjusted it to use largely mod-rewrite. The whitelist by itself works fine and my access log indicates 200s when these sites come calling.

RewriteCond %{HTTP_USER_AGENT}!site1 [NC]
RewriteCond %{HTTP_USER_AGENT}!site2 [NC]
RewriteRule!^(\favicon\.ico¦403\.htm¦robots\.txt) - [F]

However if I add
RewriteCond %{HTTP_USER_AGENT} site1 [NC]
RewriteCond %{REMOTE_ADDR}!#*$!.#*$!.#*$!.#*$!
RewriteRule!^(\favicon\.ico¦403\.htm¦robots\.txt) - [F]

all of a sudden that site is getting 403'd when it asks for the same file.

What am I doing wrong?

jdMorgan

7:05 pm on Nov 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> What am I doing wrong?

That would rather depend on what you are trying to do...

It's not clear from the code because, for example, in your first you appear to be checking the requesting HTTP_USER_AGENT for a negative match with "site1". So this implies tht you wish to reject any browser or robot not named "site1" or "site2" (if it asks for a file not on your list).

Normally, one checks the HTTP_REFERER or REMOTE_HOST if checking for specific "site" names.

Then the additional code looks for a browser or robot named "site1" requesting any file not on your list, and rejects the request if it does not come from a computer at a specific IP address.

Further, the regular-expression pattern "\favicon\.ico" has an unexplained and probably unnecessary regex escape character "\" preceding the "f".

As a result it might be quite helpful to describe, in as precise terms as possible, exactly how you want the user-agent, remote-address, and file list to be used to control access.

Jim

sgibb

7:35 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Sorry if I wasn't clear. In the first section I was trying to say: If the user agent is not say google or msn and it is not asking for my 403 file or the robot file, then don't permit.

Then I was trying to say: if it is google, only permit it if it comes from certain ip addresses.

jdMorgan

7:39 pm on Nov 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So it is your intention to block all visitors except Google and MSN (including yourself) from this site?

Jim

sgibb

7:50 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Sorry I was trying to be brief and was just using those as an example. Other sites I wanted to let through included Mozilla [2-5], Gigabot, and a list of about a dozen other user agents. My goal is to let only those user agents that are not spoofing, in.

Hope this is clearer. Thanks.

jdMorgan

8:10 pm on Nov 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's most likely that you want to create a whitelist of user-agents. Then if a visitor is not on that list, and if it requests anything but your custom 403 page or robots.txt, deny the request.

The following very brief whitelist demonstrates the idea -- See the Googlebot rule for a method for checking the IP address range.


# Skip all following rule(s) if globally-accessible files requested
RewriteRule ^(favicon\.ico¦403\.htm¦robots\.txt)$ - [L]
#
# Otherwise, check against user-agent whitelist
# Major SE robots
RewriteCond %{HTTP_USER_AGENT}<>{REMOTE_ADDR} !^Mozilla/[5-6]\.[0-9]+\ \(compatible;\ Googlebot/[2-3]\.[0-9];\ \+http://www\.google\.com/bot\.html\)<>66\.249\.
RewriteCond %{HTTP_USER_AGENT} !^(msnbot(-media¦-News¦-Products)?¦MSNPTC)/[0-9]\.[0-9]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[5-9]\.[0-9]+\ \(compatible;\ (Yahoo!\ )?Slurp;
RewriteCond %{HTTP_USER_AGENT} !^Gigabot
# Major browsers
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[4-5]\.[0-9]+\ \(compatible;\ MSIE\ [3-9]\.[0-9.]+
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[4-5]\.[0-9]+\ \(.+;\ rv:([0-9]+\.)+[0-9a-z]+\)\ Gecko/20[0-9]{6}
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[2-4]\.[0-9]+\ \[[a-z]{2}\](\ \(.+\))?
RewriteRule .* - [F]

The patterns used for the user-agents are my "good enough" patterns for most sites. The function of the Googlebot check is to concatenate the user-agent and the IP address of the requestor. The characters "<>" imply this concatenation, but actually, are not anything except a unique character string used to demarcate the end of the user-agent string from the beginning of the partial IP address. The "<>" character sequence has no special meaning to regular expressions; I just use it to remind myself of what I'm doing, and to clearly separate the two (or more) variables being checked so as to avoid ambiguity in the comparison.

Beware of line wrapping due to display width restrictions of the forum format; Each RewriteCond pattern must be all on one line.

Casual readers are warned that the above code is an example only; It will block many, many legitimate requests because the whitelists are far too exclusive as shown.

Jim

sgibb

10:38 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Thanks so much for your help Jim. Much appreciated.