Forum Moderators: phranque

Message Too Old, No Replies

Spider trap htaccess code

A small part of the bigger picture

         

fish_eye

4:41 am on Jul 28, 2005 (gmt 0)

10+ Year Member



I have successfully implemented a PHP spider trap [webmasterworld.com] but would like it to behave slightly differently to the standard offering.

It looks to me like the part of the picture I need to change is contained in the .htaccess file. Specifically:

[2]SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>[/2]

Some of you would know that the above snippet is amended by the PHP (or perl) program such that a list of IPs is built above these statements. When one of these IPs revisits, the env variable is set to "getout" and therefore a 403 is sent back to the requestor.

Rather than having a 403, I wanted to redirect to a specific page with a blurb explaining what has happened (and with some contact details - obviously not a link to the site).

My reasoning for this is that I do not want to ban ALL users of a particular IP just because one has been naughty.

I thought this might be particularly relevant for websites directed at students - some of whom would happily share IPs with other students innocently playing around with robots.

I guess the "deny from env=" is what causes the 403?

Do I need to completely rethink this part of the puzzle in order to do this redirect?

Wizcrafts

5:58 pm on Jul 28, 2005 (gmt 0)

10+ Year Member



Fish_Eye wrote:

Rather than having a 403, I wanted to redirect to a specific page with a blurb explaining what has happened

Fish_Eye;
You can go about accomplishing this goal in at least three ways.

1: Create a custom 403 page that briefly explains why visitors may be denied access (keep to about 2 kb)

2: Create a custom 403 page that says "Access Denied!" "Go here to read about our access control policies"

Provide a link on "Go here" to a "403b" page that explains your policies and what circumstances will get a visitor's IP or User Agent banned. Provide a link on the second page to a form they can use to request removal from the blocklist. Add both the 403b and removal request path/page to the "allowsome" list.

3: Add text to the Banning script that explains why the script was tripped and include a link to request-removal form. Add that form and path to the "allowsome" ENV group.

I use the second 403(b) solution, with a third page for removal requests, and allow these pages to be accesses by IPs in ENV=getout. In the year and a half since I implimented this system not one banned or 403'd visitor has ever filled out the removal request form, although many have followed the links to it and landed there.

You can also get fancy and create a special RewriteRule that redirects visitors who trip your ban script to a special banned explanation page, which has a removal request link or form. This way they never see the 403 page, but I think this is a waste of time. Adding a link from the short 403 page to a second explanation page works better for me.

Wiz

fish_eye

12:26 am on Jul 29, 2005 (gmt 0)

10+ Year Member



Thanks for the input Wiz.

Add text to the Banning script that explains why the script was tripped and include a link to request-removal form. Add that form and path to the "allowsome" ENV group.

In my case the person who is banned does not see the page generated by the Banning script.... and I take your point that it's unlikely that anyone will ever find themselves in this position - I guess I just like to cover my bases (good practice to handle otherwise irrecoverable exceptions as I see it - obviously got too much time on my hands at the moment :)).

Anyway, I'm assuming the <files *> technique is not used in this case.

Do you have an example of some htaccess code that tests the env variable (or sets and tests a custom env variable). Is it something as simple as:

[2]SetEnvIf Remote_Addr ^99.99.99.99$ getout
RewriteEngine on
RewriteCond env=getout
RewriteRule ^.*$ my403bPage.php[/2]

fish_eye

12:59 am on Jul 29, 2005 (gmt 0)

10+ Year Member



By the way - I created two posts about this (trying to keep two concepts separate):
* One (this one) about a specific line / technique in .htacess (the use of SetEnvIf and <files> and alternatives) and
* the other (http://www.webmasterworld.com/forum88/9374.htm [webmasterworld.com]) about the php spider trap.

The inevitable has happened and they are converging (and I tried so hard to do the right thing!)

Sorry (slink slink)

jdMorgan

3:33 am on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To expand on Wizcraft's comments, I leave the information about what can get you banned quite vague -- It's worded in "Terms of Use" language, and does not provide any technical details. There's no use in describing the detailed function of your home security system to a known burglar...

There is no need to mix Apache modules. Most of the time you can, but there are module load order dependencies that can trip you up (now or later if you change hosts):

Mod_rewrite Method:


RewriteEngine on
RewriteCond %{Remote_ADDR} ^99\.99\.99\.99$
RewriteCond %{REQUEST_URI} !^(my403bPage\.php¦robots\.txt¦tos\.html)$
RewriteRule .* /my403bPage.php [L]

The negative pattern in the 2nd RewriteCond prevents an infinite loop on your 403 page, and allows access to robots.txt and the site's Terms of Service page (the same function is implemented differently in the code below).

Mod_access Method:
SetEnvIf Request_URI "(my403bPage\.php¦robots\.txt¦tos.html)$" allowit
<Files *>
Order Deny,Allow
Deny from env=getout
Deny from 99.99.99.99
Allow from env=allowit
</Files>

Change the broken pipe "¦" characters above to solid pipes before use. Posting on this forum modifies them, and they will cause errors.

Jim

Wizcrafts

3:55 am on Jul 29, 2005 (gmt 0)

10+ Year Member



Fish_Eye wrote:

In my case the person who is banned does not see the page generated by the Banning script

You can add text printouts to the trap script that tells them whatever you want to. Some here feel that once a trap has been sprung there is no point wasting bandwidth informing the trapped party about your security measures. The reasoning is that this may make them work at hacking your website to get even, or to try to find a workaround.

Here is an example of html output (to screen) that is added to a Perl bot trap:

print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Access Denied</title>\n";
print "<meta name=\"robots\" content=\"noindex,nofollow\">\n";
print "</head>\n";
print "<body>\n";
print "<center><h1>Access Denied</h1></center>\n";
print "<p>To find out what may have caused you to be denied access to our website click here ([i]link to another page with explanations about your access control policies[/i])</p>\n";
print "</body>\n";
print "</html>\n";


You can use this example to create your own text printout. This is taken from a Perl script I use, but is probably compatible with PHP as well (I yield to PHP experts on this).

fish_eye

6:55 am on Jul 29, 2005 (gmt 0)

10+ Year Member



Thanks folks.

Jim, so if I have more than banned IP it would be


RewriteCond %{Remote_ADDR} ^99\.99\.99\.99$ [OR]
RewriteCond %{Remote_ADDR} ^88\.88\.88\.88$
RewriteCond %{REQUEST_URI}!^(my403bPage\.php¦robots\.txt¦tos\.html)$

jdMorgan

2:35 pm on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, the last RewriteCond is ANDed with the rest, just as you have shown it.

Jim

fish_eye

10:39 pm on Jul 30, 2005 (gmt 0)

10+ Year Member



I took your advice Jim and decided not the mix the two (kind of)... but went with the mod_access type (because the php to write the new lines in simpler - and I don't have to change it :)) so I wound up with:

SetEnvIf Remote_Addr ^99\.99\.99\.99$ getout (only written after the first violation)
SetEnvIf Request_URI "^(/my403bPage\.php¦robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
ErrorDocument 403 /my403bPage.php
Options +FollowSymLinks
RewriteEngine on
etc etc etc

Is that "bad" htaccessing?

I don't have a tos yet but it's not a bad idea.

PS. I'm not sure (really) why the robots.txt is there but I've raised this in a thread about the spider trap technique generally [webmasterworld.com] not the micro-htaccess part of it.

jdMorgan

12:25 am on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's not "bad htaccessing" but there's a spurious accented-A character following ".php" in your second SetEnvIf directive. It may just be a character set incompatibility between your machine and phpBB, or it may be real. It if is actually there in your code, it will cause problems.

I answered your question about robots.txt in the other thread.

Jim

fish_eye

3:06 am on Jul 31, 2005 (gmt 0)

10+ Year Member



The spurious A is not in my code. I think it has something to do with the old "broken pipe not displaying in this forum".

Lucky you Jim - to get to mod the forum where this must manifest in about 1 in 3 threads :)!

Many thanks, Sam.

fish_eye

1:52 pm on Aug 18, 2005 (gmt 0)

10+ Year Member



I've only just noticed this in testing one of my sites.

All works fine on the site I put this on some weeks ago but I get an unexpected error on another.

The basic banning works fine but if I then try to get into the site having been banned I get:


You don't have permission to access /getout.php on this server.
Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request.

I have identical .htaccess code (above the ErrorDocument 403).

The sites are on different servers. I don't have access to the main apache modules (not that I'm aware of anyway).

Any clues as to what may be set differently on each server?

jdMorgan

2:36 pm on Aug 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is "my403bPage\.php" named identically on both servers? You'll have to specifically allow access to it.

Jim

fish_eye

3:04 pm on Aug 18, 2005 (gmt 0)

10+ Year Member



Yep - double checked that one and the code was cut(ted) and paste(d) in.