Forum Moderators: phranque

Message Too Old, No Replies

Bad bot.pl .htaccess clarification and updates

The order of things and script questions

         

jackhandy

9:33 am on Sep 4, 2006 (gmt 0)

10+ Year Member



I am posting this here because my main issues are htaccess related in reference to this "Perl Server Side CGI Scripting" Forum topic:

modified "bad-bot" script blocks site downloads <--script I'm going to use
[webmasterworld.com...]

Previous 2 references:
bad-bot script: follow-up?
[webmasterworld.com...]

Ban malicious visitors with this Perl Script
[webmasterworld.com...]
--------------------------------------------------------
I know many of you are very adept at scripting so please be patient and allow me to ask questions. I am doing my research and compiling an .htaccess file that hopefully will work immediately when I add it to my server. Everything I am adding to the .htaccess file is being done on a gradual basis so that I will know when something specific is not working. Reasearching this forum is not an easy task and I can understand your frustration at forum noobs like me because I have literally read thousands of redundant posts where people ask the same questions over and over. I appreciate your patience with me as I rehash "an old subject" yet again, but if you play along with me here this thread and the three referenced above should bring all of this together and hopefully get in the search engine.

I didn't know this whole bad_bot setup was possible. I am not extremely adept at mod_rewrite and other Apache server functions, but I'm very good at copy/paste and doing research. I feel like I'm back in college with all the reading I've done in the last five days. I'm getting there so bear with me. Many "expert" replies in this forum are extremely vague, and you can possibly piece together things from hundreds of different topics, but just when you think you found the answer to your question along comes a hundred more better or different ways of doing it, or there's an update to something eighteen months later. The time you folks waste explaining this --> ¦ makes me wonder why, after several years, someone hasn't come up with a solution to make the damn thing be a solid line, but alas, you will once again have to point out that webmasterworld.com has a problem with broken pipes. Dismissing someone to Google to search this site doesn't necessarily help because the thousands of unanswered redundant posts in every topic are clogging up the search and causing more redundant posts. Ok, I'm done rambling, I just need for the experts here to know where my head is at and that my goal is to help make it easier for others to do what I am trying to accomplish. Nothing is perfect but this will have a huge difference on several web sites I own/manage.

----------
1. this bit of code and the 1x1 transparent gif:

<a href="decoy_false_page_name.htm" onmouseover="window.status='Burglar Alarm'; return true;" onclick="return false;">
<img src="../images_folder/oddly_named_graphic.gif" alt border="0" WIDTH="1" HEIGHT="1"></a></td>

Exactly where does the code go? On your all of my html files? A few of them? One of them? Do I include the main root index.html in that list? I understood that I had to create one html document to be used by the bad_bot script for AFTER a bad bot is caught, should I add it to that file too? What about files in directories protected by mod_rewrite through a PHP membership program? What about the PHP membership program itself? The FrontPage issue is something I see in my error logs all the time.

Second part of the first point, the Redirects:

Redirect /decoy_false_page_name.htm [mydomain.com...]
Redirect /lower_directory/decoy_false_page_name.htm [mydomain.com...]

"Redirect /lower_directory/" Is this a vague clue that I will need this Redirect for each directory containing files that I paste the gif code to?

----------
2. the robots.txt file:

User-agent: *

Disallow: /cgi-bin/

I already have the "Disallow" as listed above in my robots.txt file even though the cgi-bin on my server is not directly accessible from the web, and besides there is nothing in it anyway. Do I need to specifically add "Disallow: /cgi-bin/bad_bot.pl" (or whatever I name the file with the .pl or .cgi extension), to robots.txt when I am ready to upload the script to the cgi-bin?

FYI: I added the two file names to the robots.txt file, as suggested in the topics referenced above, in preparation of setting this all into motion next week.

----------
3. The document root .htaccess file, or the order of things to come:

A. Where exactly is bad_bot.pl going to write the banned IP's in the .htaccess file? It was stated already that the script writes them at the beginning, but... My server has Cpanel installed and the IP Banning function prints the "Deny From IP ..." all over the damn thing, usually at the bottom but often anywhere there is a blank line. I am wondering if I need to convert the 200 or so banned IP's I have now to "SetEnvIf" instead of the "Deny From..." currently being used with the <files> directive. Anything to automate this process?

B. So, this part is the first thing *I* add to my new and improved .htaccess file, correct? Is there a better, more efficient way of writing this?

#Banned IP's from script
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>

This comes next...

#Nobody can view htaccess files on this server
<Files .htaccess>
order deny,allow
deny from all
</Files>

And then this?

# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt¦/file_instead_of_what_they_want\.htm)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Redirect /decoy_false_page_name.htm [mydomain.com...]
Redirect /lower_directory/decoy_false_page_name.htm [mydomain.com...]

----------
I am compiling a list of user agents, referrers, remote addresses, and the other stuff I've gathered from my research (so far) that I will post later, and perhaps with some input from others we can compile an up-to-date list, or maybe there's a current thread that I missed where this has been done. But for now all I need to know is where does that information go in the order of things. I am using the rewrite method, I've already decided on that and I have my reasons. I do have one question though, when looking for stuff to add to the list throughout webmasterworld.com I saw there were two ways to end a line, one just has [OR] and the other included 'no case' [NC,OR]. Which is the preferred method?

Here are references to directives I currently have in my document root .htaccess file that work as they should and serve a purpose to my site so I'll need to know where they will go in the order of things to come as well.

Options +FollowSymlinks
RewriteEngine On
RewriteBase /

#No .www in domain name

#Force trailing / on all directories

#Force https on 'directory1' and 'directory2' directories

#Block access of graphics files from outside domain
(QUESTION: Can I include files with the extension .js here?)
RewriteRule .*\.(jpeg¦jpg¦gif¦png¦bmp¦js)$ - [NC,F]

I have this at the bottom of the current file, it works placed there so I am guessing that it doesn't matter.

#Increase PHP upload file size
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>

----------
Thanks for all your assistance in advance, especially Key_Master for thinkin' up the bad_bot.pl script.

jackhandy

9:50 am on Sep 7, 2006 (gmt 0)

10+ Year Member



Isn't anyone going to answer my questions?

jdMorgan

3:19 pm on Sep 7, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) The code can go anywhere, on any page. I suggest including it in your common included page header (if you have one), or putting it on your most-frequently-harvested pages first.

1b) No, you can put the code in your top-level .htaccess file, where it can apply to all subdirectories as well.

P.S. Don't use a redirect, use a rewrite. Using a redirect exposes the action, and you may end up with your bot-banning script listed in search engines, which would be very very bad.

2) You must disallow any path to the script, direct or indirect.

You should upload and test this new robots.txt several days before deploying the script, to be absolutely sure that all major search engines get an updated robots.txt before the script is activated. This applies to any changes you maker as well -- YOu should update robots.txt well in advance of adding any new access restrictions.

3) The script writes at the beginning of the file. It is simpler and faster to do it this way, because it eliminates any need to parse the .htaccess file to find the correct record insertion point.

3b) You should not need to convert anything regarding the mod_access directives. Just add the single "Deny from env=ban" in your existing <files> container. You should also combine all of your mod_access directives together under one "Order" statement to avoid trouble.

4) The order of directives in .htaccess is not an obvious thing. When .htaccess is processed, it is examined in turn by each Apache module -- but the order of execution of these modules is *not* controllable by .htaccess. Each module finds and executes only the directives that it understands, in the order found.

Therefore, the order that you place directives for any one given module is important, but you cannot force the order of execution on a directive-by-directive basis if you mix directives for different modules.

In most cases, the server is configured to process the Apache modules in a sensible order. Occasionally, someone doing his own server configuration on Apache 1.x may get the LoadModule order incorrect and mess things up, but generally, it's not a problem.

Forums such as this are best suited for single, simple, well-focused questions. Long, complicated posts and posts requesting others to do research [webmasterworld.com] will scare them off. Also, the script in question is not widely discussed, because to do so makes it easier to defeat. That concludes my 20 minutes for this morning...

Jim

jackhandy

10:44 am on Sep 11, 2006 (gmt 0)

10+ Year Member



Jim, sorry 'bout the long post. That was for my sake so I could focus on only one thread for now so I don't screw this up...

I've seen the IndexIgnore directive below in some .htaccess files on this subject and not others. Do I need this directive for the bad_bot.pl if I don't use the FrontPage server extensions?

IndexIgnore .htaccess */.?* *~ *# */HEADER* */README* */_vti*

I have done some condensing and this is what I have so far in my "not-yet-deployed" .htaccess file minus the bad bot rewrites because the list is so long:
-------------------------------------------

# Begin .htaccess: bad_bot.pl will write SetEnvIf "getout" here

# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.shtml¦/robots\.txt¦/file_instead_of_what_they_want\.htm)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
deny from 218.160.
deny from 218.161.
deny from 218.162.
deny from 218.163.
deny from 218.164.
deny from 218.165.
**SNIP** extremely long IP deny list from current .htaccess
</Files>

<IfModule mod_rewrite.c>
Options +FollowSymlinks
RewriteEngine On
RewriteBase /

#Changed bad_bot.pl Redirect's to Internal Rewrites
RewriteRule ^fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory1/fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory2/fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory3/fake_file\.html$ /cgi-bin/bad_bot.pl [L]

#Limit HTTP methods - block access to .htaccess
RewriteCond %{HTTP_METHOD} ^(PUT¦DELETE¦CONNECT)$ [OR]
RewriteCond %{REQUEST_URI} ^\.ht
RewriteRule .* - [F]

###?PLACE THE BAD BOT LIST (AND OTHER BAD STUFF) REWRITES HERE?###

#No .www in domain name
RewriteCond %{HTTP_HOST} ^www\.domain\.net$ [NC]
RewriteRule ^(.*)$ [domain.net...] [R=301,L]

#Force trailing / on all directories
RewriteCond %{REQUEST_FILENAME}!-f
RewriteCond %{REQUEST_URI}!(.*)/$
RewriteRule ^(.*)$ [domain.net...] [L,R=301]

#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} 80
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule ^(.*)$ [domain.net...] [R,L]

#Block access of graphics files from outside domain
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!^http://(.+\.)?domain\.net [NC]
RewriteCond %{HTTP_REFERER}!^https://(www\.)?domain\.net [NC]
RewriteRule .*\.(jpeg¦jpg¦gif¦png¦bmp¦js)$ - [NC,F]
</IfModule>

#Increase PHP upload file size
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>

#END HTACCESS
-------------------------------------------

Questions:
1. Where will the rewrites for all the stuff I want to block go? (ie, the {REQUEST_URI}, {HTTP_REFERER}, {HTTP_USER_AGENT}, and {REMOTE_ADDR} list I have that either sends to a 403.shtml or the /cgi-bin/bad_bot.pl script). My educated 'guess' is after "#Limit HTTP methods - block access to .htaccess" rewrite rule.

2. The files directive above is different than the one in my current .htaccess file. Does the files directive below indicate allow access to all ONLY for 403.shtml? Whereas the one above being all files are denied with the exception of 'allowsome' (which is basically the same thing)? I've spent so much time on getting the rewrite rules correct that I am now completely confused by the files directive. In my original post above I used two different versions of the files directive from two different versions of the bad_bot.pl script, but I think I got that one straightened out and narrowed down! I don't want to block any innocent people that are eager to put money in my pocket.

#In current .htaccess file
<Files 403.shtml>
order allow,deny
allow from all
deny from 211.161.
deny from 211.162.
**snip**
</files>

I won't post my 'bad bot and other bad stuff list' until what I have done so far is squared away. I don't have root access to my server so I'm counting on it being "configured to process the Apache modules in a sensible order" as Jim stated above. I am not asking anyone to do my research, it's done, I just need validation that what I have compiled will work "out of the box" on Apache 1.x. As I stated, researching this forum can be more confusing than the results. As long as I get the syntax correct my server appears to process what I put in the .htaccess file correctly.

jackhandy

9:46 pm on Sep 13, 2006 (gmt 0)

10+ Year Member



Please, somebody respond. I need to get this thing going. My updated robots.txt has been online for about 8 days now...

jdMorgan

11:20 pm on Sep 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Questions:
1. Where will the rewrites for all the stuff I want to block go? (ie, the {REQUEST_URI}, {HTTP_REFERER}, {HTTP_USER_AGENT}, and {REMOTE_ADDR} list I have that either sends to a 403.shtml or the /cgi-bin/bad_bot.pl script). My educated 'guess' is after "#Limit HTTP methods - block access to .htaccess" rewrite rule.

If the end result is blocking -- via a 403 response, then it really doesn't matter what order you put them in.

A good way to order the blocked accesses is to put them in most-frequently-blocked order first. For lack of any better determinant, this simply 'gets rid of' the worst abusers as fast as possible, without wasting time running the subsequent rules.

When you add the 'ban list' stuff in, then that complicates things slightly -- You don't want to dismiss a bad user-agent with a 'block' if it is accessing a protected file and deserves a ban, for example. In most cases, this is taken care of by the fact that mod_access directives are processed before mod_rewrite directives, regardless of their position in your .htaccess file, so previously-banned visitors won't even run your mod_rewrite code. But it's the initial request handling (before the malicious visitor is banned) where attention to rule order is required.

2. The files directive above is different than the one in my current .htaccess file. Does the files directive below indicate allow access to all ONLY for 403.shtml? Whereas the one above being all files are denied with the exception of 'allowsome' (which is basically the same thing)? I've spent so much time on getting the rewrite rules correct that I am now completely confused by the files directive. In my original post above I used two different versions of the files directive from two different versions of the bad_bot.pl script, but I think I got that one straightened out and narrowed down! I don't want to block any innocent people that are eager to put money in my pocket.

#In current .htaccess file
<Files 403.shtml>
order allow,deny
allow from all
deny from 211.161.
deny from 211.162.
**snip**
</files>

A <Files> container simply makes the execution of the code it contains conditional; In the example directly above, the code will not be executed if the filename is not 403.shtml. If the filename *is* 403.shtml, then access will be denied if the requesting IP address starts with 211.161 or 211.162. So the code is backwards from what you want.

However, you should never deny access to your 403 Error page, because the result of this denial will be another 403 error response, which will attempt to access the 403 error page again, leading to another 403 response, ad infinitum, until either the client or the server reaches its maximum redirection limit. Basically, you should never deny access to your robots.txt and your 403 error handler under any circumstances. If you do, your site will not be compliant with HTTP protocol requirements. That is the purpose of the line of code that sets the "allowsome" environment variable.

Taking this code one section at a time, and in no particular order, here are some comments:

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
deny from 218.160.
deny from 218.161.
deny from 218.162.
deny from 218.163.
deny from 218.164.
deny from 218.165.
**SNIP** extremely long IP deny list from current .htaccess
</Files>

Here, it appears that you might benefit from using regular expressions. This would require the use of SetEnvIf, but might compress your list significantly. For example

SetEnvIf Remote-Addr ^215\.16[0-5]\. getout

followed by
Deny from getout

in the later <Files> section replaces six directives with one. Much more complex regex patterns can be used, and may reduce the number of directives even more.

Be aware that you may use one and only one "Order" directive in your .htaccess file. Only the last one found will apply.

I strongly suggest that you use the capitalization conventions shown in the Apache module documentation. Case should not matter, but on rare occasions, I have seen servers where it did -- probably due to a fundamental OS misconfiguration. But I recommend making your code as robust as possible, since such an OS config problem might be out of your control on shared hosting, for example.

 <IfModule mod_rewrite.c> 

This directive is of little use. It can be used in code that is meant to be applied to a wide variety of servers. The result is that if mod_rewrite is not loaded, the code won't be executed, and no mod_rewrite errors will be reported as a result. This means, you'll have a silent failure if your host disables mod_rewrite. Of course, disabling mod_rewrite will also disable most of your site as well, so I'd say that it would be good if you got mod_rewrite errors, rather than a silent failure!

#Limit HTTP methods - block access to .htaccess
RewriteCond %{HTTP_METHOD} ^(PUT¦DELETE¦CONNECT)$ [OR]
RewriteCond %{REQUEST_URI} ^\.ht
RewriteRule .* - [F]

You might want to change the pattern for .htaccess and .htpasswd files here, since this code will only protect .htaccess and .htpasswd files in your top-level directory:

RewriteCond %{REQUEST_URI} ^([^/]+/)*\.ht

will match any number of directories, followed by a slash, with ".ht" immediately foloowing that slash.

#Force trailing / on all directories
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} !(.*)/$
RewriteRule ^(.*)$ http://example.com/$1/ [L,R=301]

Remember that filesystem checks like "-f" or "-d" can be hundreds of times slower than just running this code. If the code must call the OS to go check the filesystem on disk, it can have a mojor performance impact on your server. Therefore, the filecheck is the last thing you should check:

#Force trailing / on all directories
RewriteCond %{REQUEST_URI} !/$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule (.*) http://example.com/$1/ [L,R=301]

Note also that I eliminated unnecessary/redundant regex tokens.

#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} 80
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule ^(.*)$ https://example.com/$1 [R,L]

You want an HTTPS redirect on any port that includes "80", like port 8000 or port 8080, or port 65480? And you want a 302-Temporary redirect? I suspect that neither case is true, so I'd suggest:

#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} ="80"
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule (.*) https://example.com/$1 [R=301,L]

Here, we can simplify the regex again by eliminating unnecessary/redundant tokens and patterns.


#Block access of graphics files from outside domain
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?example\.com [NC]
RewriteCond %{HTTP_REFERER} !^https://(www\.)?example\.com [NC]
RewriteRule \.(jpe?g¦gif¦png¦bmp¦js)$ - [NC,F]

#Increase PHP upload file size 
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>

Here again, it's likely your whole site will be useless if php4 isn't loaded, so there may not be any use in slowing down every request just for the sake of suppressing an error message you'd probably want to see if your host diasabled php4.

---

In general, the following order works well for me, with "block" meaning a 403 reponse, and "ban" meaning call the bad-bot script.

  • Bad-bot bans using mod_setenvif and mod_access "Deny From"
  • Bypass all following mod_rewrite rules if 403 page, robots.txt, or bad-bot script URL is requested by using "- [L]"-type rule.
  • Block unwanted HTTP methods
  • Block .htaccess, .htpasswd requests
  • Ban User-agents which are know to "morph" into other user-agents if blocked, i.e. MS URL Control
  • Ban if referrer or User-agent is a "fake" blank value (actually containing a hyphen)
  • Block if both referrer and user-agent are blank, except for HEAD requests.
  • Block by User-agent or IP address (legacy blocks from log-watching and known harvesters)
  • Block faked or malformed Mozilla User-agents. (Missing semicolons, invalid Windows versions, etc.)
  • Block proxy throughput requests {e.g. Request_URI is something like "http://www.yahoo.com" and not your own domain(s)}
  • Block image/script hotlinking
  • Ban requests for URL-paths mentioned ONLY in robots.txt
  • Ban invalid attempts to POST to e-mail forms
  • Ban attempts to fetch Disllowed files (I strongly recommend excluding specific known search engine spider UAs and Wireless access proxy UAs from this rule)
  • 410-Gone handling for removed pages
  • Per-page redirects for replaced pages
  • Per-domain redirects for non-canonical domain request, HTTPS, etc.

    In conclusion, you'll need to step back from this code and interpret it on the block level to determine whether you're happy with the ordering of the mod_rewrite rules. If you put all the bad-bot rewrites at the top, then you may end up calling bad-bot.pl and recording the individual IP addresses of some very common "dumb" scraper User-agents that never change their User-agent name. Doing that instead of simply blocking them by User-agent would be a waste of space in your blocked IP list, and would cause it to grow too fast. So, block the obvious scraper Uas first before running the bad-bot ban code.

    I hope this was useful. We don't encourage code review-type threads here because it takes far too long to review code at both the line-by-line level and at the "holistic" level, so such threads don't comport well with the limited scope of typical forum interactions. But maybe this thread will benefit others in the future and make the time invested worthwhile.

    Replace all broken pipe "¦" characters in the code above with solid pipe characters before use; Posting on this forum modifies the pipe character.

    Jim

  •