Welcome to WebmasterWorld Guest from 54.196.2.131

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Proper syntax for banning bad bots in htaccess

using mod_rewrite and htaccess commands correctly for banning bad bots

     
12:22 am on Jun 11, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 31, 2002
posts:157
votes: 0


Yello,

In my never ending quest to banish all evil bots that dare to traverse my site for its own diabolical end, I think I went overboard in the way I have my htaccess file set up. If anyone knowledgeable with htaccess commands could give me a hand, Id appreciate it.

Here's the issue, using some of the suggestions made in this forum, I set up the following in my htaccess:

SetEnvIf User-Agent ^$ keep_out
<Files ~ "(\.html¦\.jpg¦\.gif¦\.php¦\.inc¦\.txt $">
order allow,deny
allow from all
deny from env=keep_out
</Files>

This prevents bots that do not have a user agent identity from grabbing the above named files. I also have the following commands as well:

SetEnvIf Remote_Addr blah blah blah
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>

and

<Limit GET POST >
order allow,deny
allow from all
deny from 123.123.123.3
</Limit>

The first is manual, the last two are written into htaccess by a couple of scripts I run which automate the banning process. What I noticed is that they tend to cancel each other out because of the way its set up. Is there a way to merge all three command groups so they're not giving redundant commands? Note I have mod_rewrite installed so i could use Rewrite commands if it's neccesary.

I just need to find a way to consolidate these commands if possible...

1:51 am on June 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Bluestreak,

You can have multiple test conditions "share" a single environment variable, and then use that single environment variable in the "Deny from env=" directive...

SetEnvIf User-Agent ^$ keep_out #block blank UA
SetEnvIf User-Agent Indy.Library$ keep_out #block Indy_Library spambot variants
SetEnvIf Remote_Addr ^1\.2\.\3\..* keep_out #block IP address 1.2.3.xx
SetEnvIf Remote_Host \.badguy.com$ keep_out #block all subdomains on remote host
SetEnvIf Request_URI cmd.exe$ keep_out #block MSIIS virus file request
SetEnvIf Remote_User ^toto$ keep_out #block user named "toto"
SetEnvIf Referer ^www.BadNeighbor.com keep_out #block referals from a link farm
.
.
.
<Limit GET POST >
order allow,deny
allow from all
deny from env=keep_out
</Limit>

I haven't tried to preserve all of the specifics of your current set-up, just trying to illustrate the "sharing" concept.

P.S. Be careful with those hats and dollars...
^x$ - match exactly "x"
x$ - match anything ending in "x" -- equivalent to ^.*x$
^x - match anything starting with "x" -- equivalent to ^x.*$

Hope this helps,
Jim

3:46 am on June 11, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 31, 2002
posts:157
votes: 0


Thanks for the response. I think im gaining a better understanding of what to do, but Id like to confirm a question:

Would I be able to add an additional "Deny" line like this:

SetEnvIf Remote_Addr 123.123.123.1 ban
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
deny from 456.456.456.2
</Files>

Would I be able to do it this way? Seems a lot simpler.....

5:29 am on June 11, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 31, 2002
posts:157
votes: 0


Ok, after some testing Ive determined I CAN add the additional deny line. I knew you can have multiple deny lines for ip addresses, but I didnt know if you could do the same for a defined variable. Now I know :D

I have one more question, this one should be easy:

Does the commandline IndexIgnore * HAVE to be at the top of the htaccess file? The way the scripts are setup, it writes a new IP address so the htaccess ends up like this:

SetEnvIf Remote_Addr 123.123.123.1 ban

IndexIgnore *

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>

***********************

Is that ok? Does placement matter for the Index Ignore command?

Tanks for the help!

5:53 am on June 11, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 31, 2002
posts:157
votes: 0


OK, one more question about using REWRITE, and then Im going to be, I swear :D

Can I use the command to block a nameless user agent from accessing certain types of files, like I showed before, but use REWRITE instead of the typical <Files>...</Files> command? It would look like this:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} !^$
RewriteRule ^.*$ yourbanned.html [L]

What I want to do is make exceptions to certain files, since some nameless agents are actually java agents looking for js files, etc. The above just gives a blanket ban.

Thanks for the help!

2:29 am on June 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


BlueStreak,

Sorry for the delay - I couldn't find this thread again, for some reason.

The Apache ModAccess documentation for the Allow and Deny directives specifically mention Allow and Deny in the plural, so yes, I think you can add "denys" at will.

I don't know about "IndexIgnore" - haven't seen or used that one!

As to banning blank User Agents with modRewrite, yes you can do that, but be careful. Some Search Engines "sneak into" your site using different user agents and possibly even blank user agents, to see if you are cloaking. If you block too indiscriminately, your site will look like it's cloaked, and you could get dropped from search engines!

I ban "foreign" Referers, not specific or blank User-Agents, from copying or including my gifs and jpegs.

This is what it looks like:

ErrorDocument 403 /403.html
# Block image inclusion outside our domain except Google, AltaVista, Gigablast translators and caches
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.quuxcorp\.org
RewriteCond %{HTTP_REFERER} !^http://216\.239\.(3[2-9]¦[4-5][0-9]¦6[0-3])\..*www\.quuxcorp\.org/
RewriteCond %{HTTP_REFERER} !^http://babel.altavista.com/.*www\.quuxcorp\.org/
RewriteCond %{HTTP_REFERER} !^http://216\.243\.113\.1/cgi/
RewriteRule \.(jpg¦jpeg?¦gif)$ - [F,L]

The first RewriteCondition specifically allows a blank referer, since a lot of browsers won't include a referer, either because they are old, or because the user typed in the URL or used a bookmark.
The second line allows my own domain, the next three are to allow Altavista, Google, and GigaBlast to show my graphics in their translated or cached pages.
The RewriteRule says, "redirect requests for jpg,jpe,jpeg, or gif files to NO URL (-), return a 403-forbidden server code, and stop processing rewrite rules. The 403 redirect then gets picked up by the ErrorDocument directive on the first line, which will then serve a custom 403 page if the user agent is a browser as opposed to a robot.

Hope this helps,
Jim

3:13 am on June 12, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 31, 2002
posts:157
votes: 0


Hi, thanks for the response!

Im going to leave out the nameless user_agent because some nameless agents are legitimate, and I havent gotten nearly enough visits from unidentified user_agents to justify it.

Index Ignore FYI blocks the browser from viewing a directory listing of your web files, pretty useful to stop people from snooping to see what files ar ein your directory.

I do have a banlist blocking user_agents with specific names relating to offline browsers like Teleport Pro. Ive never heard of a search bot masquerading as something else to test for cloaking though. I dont cloak at all BTW.

Since my banlist is pretty much a short list of offline browsers and names of notorious spambots, I dont think Ill have a problem. I dont think a search bot would intentionally take on the name of a notorious spambot (like EmailSiphon) just to test for cloaking. Wouldnt be very wise :D

Thanks for the help!

2:42 pm on June 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Huh... (puzzled)

I use "Options -Indexes" in .htaccess to accomplish the directory-index suppression - I'll have to look up the method you're using.

I've never seen a legitimate SE robot use a spambot User-Agent, and I also doubt they'd want to try that! Also, to clarify, the agent using an alternate User-Agent may or may not be the SE robot itself - It is more often a different program or even a human reviewer looking to make sure your site is serving the same (basically) page to everyone, robot or not, to catch cloakers.

I can't help but wonder what our spambot blocker lists will look like in ten years (hundreds of entries... thousands?), and whether the servers will be able to get through the whole list before the user gets tired of waiting and aborts the page load!!! :o

Jim