Forum Moderators: phranque
modified "bad-bot" script blocks site downloads <--script I'm going to use
[webmasterworld.com...]
Previous 2 references:
bad-bot script: follow-up?
[webmasterworld.com...]
Ban malicious visitors with this Perl Script
[webmasterworld.com...]
--------------------------------------------------------
I know many of you are very adept at scripting so please be patient and allow me to ask questions. I am doing my research and compiling an .htaccess file that hopefully will work immediately when I add it to my server. Everything I am adding to the .htaccess file is being done on a gradual basis so that I will know when something specific is not working. Reasearching this forum is not an easy task and I can understand your frustration at forum noobs like me because I have literally read thousands of redundant posts where people ask the same questions over and over. I appreciate your patience with me as I rehash "an old subject" yet again, but if you play along with me here this thread and the three referenced above should bring all of this together and hopefully get in the search engine.
I didn't know this whole bad_bot setup was possible. I am not extremely adept at mod_rewrite and other Apache server functions, but I'm very good at copy/paste and doing research. I feel like I'm back in college with all the reading I've done in the last five days. I'm getting there so bear with me. Many "expert" replies in this forum are extremely vague, and you can possibly piece together things from hundreds of different topics, but just when you think you found the answer to your question along comes a hundred more better or different ways of doing it, or there's an update to something eighteen months later. The time you folks waste explaining this --> ¦ makes me wonder why, after several years, someone hasn't come up with a solution to make the damn thing be a solid line, but alas, you will once again have to point out that webmasterworld.com has a problem with broken pipes. Dismissing someone to Google to search this site doesn't necessarily help because the thousands of unanswered redundant posts in every topic are clogging up the search and causing more redundant posts. Ok, I'm done rambling, I just need for the experts here to know where my head is at and that my goal is to help make it easier for others to do what I am trying to accomplish. Nothing is perfect but this will have a huge difference on several web sites I own/manage.
----------
1. this bit of code and the 1x1 transparent gif:
<a href="decoy_false_page_name.htm" onmouseover="window.status='Burglar Alarm'; return true;" onclick="return false;">
<img src="../images_folder/oddly_named_graphic.gif" alt border="0" WIDTH="1" HEIGHT="1"></a></td>
Exactly where does the code go? On your all of my html files? A few of them? One of them? Do I include the main root index.html in that list? I understood that I had to create one html document to be used by the bad_bot script for AFTER a bad bot is caught, should I add it to that file too? What about files in directories protected by mod_rewrite through a PHP membership program? What about the PHP membership program itself? The FrontPage issue is something I see in my error logs all the time.
Second part of the first point, the Redirects:
Redirect /decoy_false_page_name.htm [mydomain.com...]
Redirect /lower_directory/decoy_false_page_name.htm [mydomain.com...]
"Redirect /lower_directory/" Is this a vague clue that I will need this Redirect for each directory containing files that I paste the gif code to?
----------
2. the robots.txt file:
User-agent: *
Disallow: /cgi-bin/
I already have the "Disallow" as listed above in my robots.txt file even though the cgi-bin on my server is not directly accessible from the web, and besides there is nothing in it anyway. Do I need to specifically add "Disallow: /cgi-bin/bad_bot.pl" (or whatever I name the file with the .pl or .cgi extension), to robots.txt when I am ready to upload the script to the cgi-bin?
FYI: I added the two file names to the robots.txt file, as suggested in the topics referenced above, in preparation of setting this all into motion next week.
----------
3. The document root .htaccess file, or the order of things to come:
A. Where exactly is bad_bot.pl going to write the banned IP's in the .htaccess file? It was stated already that the script writes them at the beginning, but... My server has Cpanel installed and the IP Banning function prints the "Deny From IP ..." all over the damn thing, usually at the bottom but often anywhere there is a blank line. I am wondering if I need to convert the 200 or so banned IP's I have now to "SetEnvIf" instead of the "Deny From..." currently being used with the <files> directive. Anything to automate this process?
B. So, this part is the first thing *I* add to my new and improved .htaccess file, correct? Is there a better, more efficient way of writing this?
#Banned IP's from script
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>
This comes next...
#Nobody can view htaccess files on this server
<Files .htaccess>
order deny,allow
deny from all
</Files>
And then this?
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt¦/file_instead_of_what_they_want\.htm)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
Redirect /decoy_false_page_name.htm [mydomain.com...]
Redirect /lower_directory/decoy_false_page_name.htm [mydomain.com...]
----------
I am compiling a list of user agents, referrers, remote addresses, and the other stuff I've gathered from my research (so far) that I will post later, and perhaps with some input from others we can compile an up-to-date list, or maybe there's a current thread that I missed where this has been done. But for now all I need to know is where does that information go in the order of things. I am using the rewrite method, I've already decided on that and I have my reasons. I do have one question though, when looking for stuff to add to the list throughout webmasterworld.com I saw there were two ways to end a line, one just has [OR] and the other included 'no case' [NC,OR]. Which is the preferred method?
Here are references to directives I currently have in my document root .htaccess file that work as they should and serve a purpose to my site so I'll need to know where they will go in the order of things to come as well.
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
#No .www in domain name
#Force trailing / on all directories
#Force https on 'directory1' and 'directory2' directories
#Block access of graphics files from outside domain
(QUESTION: Can I include files with the extension .js here?)
RewriteRule .*\.(jpeg¦jpg¦gif¦png¦bmp¦js)$ - [NC,F]
I have this at the bottom of the current file, it works placed there so I am guessing that it doesn't matter.
#Increase PHP upload file size
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>
----------
Thanks for all your assistance in advance, especially Key_Master for thinkin' up the bad_bot.pl script.
1b) No, you can put the code in your top-level .htaccess file, where it can apply to all subdirectories as well.
P.S. Don't use a redirect, use a rewrite. Using a redirect exposes the action, and you may end up with your bot-banning script listed in search engines, which would be very very bad.
2) You must disallow any path to the script, direct or indirect.
You should upload and test this new robots.txt several days before deploying the script, to be absolutely sure that all major search engines get an updated robots.txt before the script is activated. This applies to any changes you maker as well -- YOu should update robots.txt well in advance of adding any new access restrictions.
3) The script writes at the beginning of the file. It is simpler and faster to do it this way, because it eliminates any need to parse the .htaccess file to find the correct record insertion point.
3b) You should not need to convert anything regarding the mod_access directives. Just add the single "Deny from env=ban" in your existing <files> container. You should also combine all of your mod_access directives together under one "Order" statement to avoid trouble.
4) The order of directives in .htaccess is not an obvious thing. When .htaccess is processed, it is examined in turn by each Apache module -- but the order of execution of these modules is *not* controllable by .htaccess. Each module finds and executes only the directives that it understands, in the order found.
Therefore, the order that you place directives for any one given module is important, but you cannot force the order of execution on a directive-by-directive basis if you mix directives for different modules.
In most cases, the server is configured to process the Apache modules in a sensible order. Occasionally, someone doing his own server configuration on Apache 1.x may get the LoadModule order incorrect and mess things up, but generally, it's not a problem.
Forums such as this are best suited for single, simple, well-focused questions. Long, complicated posts and posts requesting others to do research [webmasterworld.com] will scare them off. Also, the script in question is not widely discussed, because to do so makes it easier to defeat. That concludes my 20 minutes for this morning...
Jim
I've seen the IndexIgnore directive below in some .htaccess files on this subject and not others. Do I need this directive for the bad_bot.pl if I don't use the FrontPage server extensions?
IndexIgnore .htaccess */.?* *~ *# */HEADER* */README* */_vti*
I have done some condensing and this is what I have so far in my "not-yet-deployed" .htaccess file minus the bad bot rewrites because the list is so long:
-------------------------------------------
# Begin .htaccess: bad_bot.pl will write SetEnvIf "getout" here
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.shtml¦/robots\.txt¦/file_instead_of_what_they_want\.htm)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
deny from 218.160.
deny from 218.161.
deny from 218.162.
deny from 218.163.
deny from 218.164.
deny from 218.165.
**SNIP** extremely long IP deny list from current .htaccess
</Files>
<IfModule mod_rewrite.c>
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
#Changed bad_bot.pl Redirect's to Internal Rewrites
RewriteRule ^fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory1/fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory2/fake_file\.html$ /cgi-bin/bad_bot.pl [L]
RewriteRule ^directory3/fake_file\.html$ /cgi-bin/bad_bot.pl [L]
#Limit HTTP methods - block access to .htaccess
RewriteCond %{HTTP_METHOD} ^(PUT¦DELETE¦CONNECT)$ [OR]
RewriteCond %{REQUEST_URI} ^\.ht
RewriteRule .* - [F]
###?PLACE THE BAD BOT LIST (AND OTHER BAD STUFF) REWRITES HERE?###
#No .www in domain name
RewriteCond %{HTTP_HOST} ^www\.domain\.net$ [NC]
RewriteRule ^(.*)$ [domain.net...] [R=301,L]
#Force trailing / on all directories
RewriteCond %{REQUEST_FILENAME}!-f
RewriteCond %{REQUEST_URI}!(.*)/$
RewriteRule ^(.*)$ [domain.net...] [L,R=301]
#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} 80
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule ^(.*)$ [domain.net...] [R,L]
#Block access of graphics files from outside domain
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!^http://(.+\.)?domain\.net [NC]
RewriteCond %{HTTP_REFERER}!^https://(www\.)?domain\.net [NC]
RewriteRule .*\.(jpeg¦jpg¦gif¦png¦bmp¦js)$ - [NC,F]
</IfModule>
#Increase PHP upload file size
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>
#END HTACCESS
-------------------------------------------
Questions:
1. Where will the rewrites for all the stuff I want to block go? (ie, the {REQUEST_URI}, {HTTP_REFERER}, {HTTP_USER_AGENT}, and {REMOTE_ADDR} list I have that either sends to a 403.shtml or the /cgi-bin/bad_bot.pl script). My educated 'guess' is after "#Limit HTTP methods - block access to .htaccess" rewrite rule.
2. The files directive above is different than the one in my current .htaccess file. Does the files directive below indicate allow access to all ONLY for 403.shtml? Whereas the one above being all files are denied with the exception of 'allowsome' (which is basically the same thing)? I've spent so much time on getting the rewrite rules correct that I am now completely confused by the files directive. In my original post above I used two different versions of the files directive from two different versions of the bad_bot.pl script, but I think I got that one straightened out and narrowed down! I don't want to block any innocent people that are eager to put money in my pocket.
#In current .htaccess file
<Files 403.shtml>
order allow,deny
allow from all
deny from 211.161.
deny from 211.162.
**snip**
</files>
I won't post my 'bad bot and other bad stuff list' until what I have done so far is squared away. I don't have root access to my server so I'm counting on it being "configured to process the Apache modules in a sensible order" as Jim stated above. I am not asking anyone to do my research, it's done, I just need validation that what I have compiled will work "out of the box" on Apache 1.x. As I stated, researching this forum can be more confusing than the results. As long as I get the syntax correct my server appears to process what I put in the .htaccess file correctly.
Questions:
1. Where will the rewrites for all the stuff I want to block go? (ie, the {REQUEST_URI}, {HTTP_REFERER}, {HTTP_USER_AGENT}, and {REMOTE_ADDR} list I have that either sends to a 403.shtml or the /cgi-bin/bad_bot.pl script). My educated 'guess' is after "#Limit HTTP methods - block access to .htaccess" rewrite rule.
If the end result is blocking -- via a 403 response, then it really doesn't matter what order you put them in.
A good way to order the blocked accesses is to put them in most-frequently-blocked order first. For lack of any better determinant, this simply 'gets rid of' the worst abusers as fast as possible, without wasting time running the subsequent rules.
When you add the 'ban list' stuff in, then that complicates things slightly -- You don't want to dismiss a bad user-agent with a 'block' if it is accessing a protected file and deserves a ban, for example. In most cases, this is taken care of by the fact that mod_access directives are processed before mod_rewrite directives, regardless of their position in your .htaccess file, so previously-banned visitors won't even run your mod_rewrite code. But it's the initial request handling (before the malicious visitor is banned) where attention to rule order is required.
2. The files directive above is different than the one in my current .htaccess file. Does the files directive below indicate allow access to all ONLY for 403.shtml? Whereas the one above being all files are denied with the exception of 'allowsome' (which is basically the same thing)? I've spent so much time on getting the rewrite rules correct that I am now completely confused by the files directive. In my original post above I used two different versions of the files directive from two different versions of the bad_bot.pl script, but I think I got that one straightened out and narrowed down! I don't want to block any innocent people that are eager to put money in my pocket.#In current .htaccess file
<Files 403.shtml>
order allow,deny
allow from all
deny from 211.161.
deny from 211.162.
**snip**
</files>
A <Files> container simply makes the execution of the code it contains conditional; In the example directly above, the code will not be executed if the filename is not 403.shtml. If the filename *is* 403.shtml, then access will be denied if the requesting IP address starts with 211.161 or 211.162. So the code is backwards from what you want.
However, you should never deny access to your 403 Error page, because the result of this denial will be another 403 error response, which will attempt to access the 403 error page again, leading to another 403 response, ad infinitum, until either the client or the server reaches its maximum redirection limit. Basically, you should never deny access to your robots.txt and your 403 error handler under any circumstances. If you do, your site will not be compliant with HTTP protocol requirements. That is the purpose of the line of code that sets the "allowsome" environment variable.
Taking this code one section at a time, and in no particular order, here are some comments:
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
deny from 218.160.
deny from 218.161.
deny from 218.162.
deny from 218.163.
deny from 218.164.
deny from 218.165.
**SNIP** extremely long IP deny list from current .htaccess
</Files>
SetEnvIf Remote-Addr ^215\.16[0-5]\. getout Deny from getout Be aware that you may use one and only one "Order" directive in your .htaccess file. Only the last one found will apply.
I strongly suggest that you use the capitalization conventions shown in the Apache module documentation. Case should not matter, but on rare occasions, I have seen servers where it did -- probably due to a fundamental OS misconfiguration. But I recommend making your code as robust as possible, since such an OS config problem might be out of your control on shared hosting, for example.
<IfModule mod_rewrite.c>
#Limit HTTP methods - block access to .htaccess
RewriteCond %{HTTP_METHOD} ^(PUT¦DELETE¦CONNECT)$ [OR]
RewriteCond %{REQUEST_URI} ^\.ht
RewriteRule .* - [F]
RewriteCond %{REQUEST_URI} ^([^/]+/)*\.ht
#Force trailing / on all directories
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} !(.*)/$
RewriteRule ^(.*)$ http://example.com/$1/ [L,R=301]
#Force trailing / on all directories
RewriteCond %{REQUEST_URI} !/$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule (.*) http://example.com/$1/ [L,R=301]
#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} 80
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule ^(.*)$ https://example.com/$1 [R,L]
#Force https on 'directory1' and 'directory2' directories
RewriteCond %{SERVER_PORT} ="80"
RewriteCond %{REQUEST_URI} directory1 [OR]
RewriteCond %{REQUEST_URI} directory2
RewriteRule (.*) https://example.com/$1 [R=301,L]
Here, we can simplify the regex again by eliminating unnecessary/redundant tokens and patterns.
#Block access of graphics files from outside domain
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?example\.com [NC]
RewriteCond %{HTTP_REFERER} !^https://(www\.)?example\.com [NC]
RewriteRule \.(jpe?g¦gif¦png¦bmp¦js)$ - [NC,F]
#Increase PHP upload file size
<IfModule mod_php4.c>
php_value upload_max_filesize 10M
</IfModule>
---
In general, the following order works well for me, with "block" meaning a 403 reponse, and "ban" meaning call the bad-bot script.
In conclusion, you'll need to step back from this code and interpret it on the block level to determine whether you're happy with the ordering of the mod_rewrite rules. If you put all the bad-bot rewrites at the top, then you may end up calling bad-bot.pl and recording the individual IP addresses of some very common "dumb" scraper User-agents that never change their User-agent name. Doing that instead of simply blocking them by User-agent would be a waste of space in your blocked IP list, and would cause it to grow too fast. So, block the obvious scraper Uas first before running the bad-bot ban code.
I hope this was useful. We don't encourage code review-type threads here because it takes far too long to review code at both the line-by-line level and at the "holistic" level, so such threads don't comport well with the limited scope of typical forum interactions. But maybe this thread will benefit others in the future and make the time invested worthwhile.
Replace all broken pipe "¦" characters in the code above with solid pipe characters before use; Posting on this forum modifies the pipe character.
Jim