homepage Welcome to WebmasterWorld Guest from 54.161.240.10
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
blocking robots using apache
is there something sophisticated to do that?
flex55




msg:3227373
 9:26 am on Jan 22, 2007 (gmt 0)

Hi All,

I've recently had a huge crawl by a number of spambots on my sites. I need to start blocking them.. Wanted to consult with you about it.
A few years ago, on a java based project I was involved at, we solved this by monitoring number of requests per minutes of certain pages, and for every unfriendly useragent / ip, if the number of requests per min met a certain threshold, it was assumed that this host is a hostile bot- and a the host was blocked with a captcha page- the host was unblocked only when it passed the captcha test.

Now, I'm working oh php platform, and I wouldnt want to go through the hassle of re-developing the entire mechanism in php- plus, since it's been a few years, i thought that something like this must exist :-)

I wanted to ask if anyone knows on an apache module / script that does something similar to a site-
ie- identify hostile bots, and preset them with captcha tests or otherwise block them.

many thanks!

 

easygoin




msg:3227413
 10:23 am on Jan 22, 2007 (gmt 0)

Hiya - have a look here, been done already (there is a perl version also somewhere in here...) [webmasterworld.com...] as well as as well as [kloth.net...]

Hope this helps - I have the perl version and it works "lovely jubbly" but would like to use the php - just my pref so will be doing the same soon.

PS - use the search features to find most things in the forums, you can pretty much assume it has been covered before by these knowledgeable peeps!

Dimitri

jdMorgan




msg:3227607
 3:22 pm on Jan 22, 2007 (gmt 0)

Dimitri,

The PERL and PHP scripts you refer to are two completely-different scripts that do different things. Both block bad 'bots, but in different ways. So you can --and possibly should-- use both. But one does not replace the other; They use entirely different methods to detect site abuse.

Jim

easygoin




msg:3229226
 8:45 pm on Jan 23, 2007 (gmt 0)

Sorry Jim, that's right, I meant that they essentialy block the bad botties, apart from that they do indeed use different methods. Thanks again for your help on this forum, it is invaluable.

So i dont start some new thread for what is a request for a quick checkup, I hope you don't mind peeps, but I would like to ask if you can see any syntax/code/regex errors below, as the user agent name blocks don't seem to work now? I used www.wannabrowser.com to pretend to be one of the UA's I am trying to block, and I got a lovely page of my html output :( and am very disappointed. Can anyone help? Thanks.


SetEnvIf Remote_Addr ^213\.105\.224\.15$ ban
# Rules above are written by the robots trap
# and must be at the very top, here

SetEnvIf User-Agent (.){150} ban
SetEnvIf User-Agent ^$ ban
SetEnvIf User-Agent ^([A-Z]+)$ ban
SetEnvIfNoCase Request_URI (.){150} ban
SetEnvIfNoCase Request_URI \.ht(access¦passwd)$ ban
SetEnvIfNoCase Request_URI ^/[a-z]/winnt ban
SetEnvIfNoCase Request_URI ^/_mem_bin ban
SetEnvIfNoCase Request_URI ^/_vti_bin ban
SetEnvIfNoCase Request_URI ^/default\.ida ban
SetEnvIfNoCase Request_URI ^/exchange ban
SetEnvIfNoCase Request_URI ^/msadc ban
SetEnvIfNoCase Request_URI ^/msoffice ban
SetEnvIfNoCase Request_URI ^/null\. ban
SetEnvIfNoCase Request_URI ^/script ban
SetEnvIfNoCase Request_URI formmail ban

# The domain ext. require RDNS lookups so are quite
# resource intensive, only use if you have to
# and limit to bare essentials - thanks jdMorgan again.
# Best to restrict <Files> limit to just pages and not
# css or java / scripts etc but tricky to do with SEFs
# so just a general files rule
# <Files *> hmmm

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
# deny from .cn
# deny from .cz
# deny from .ng
# deny from .ph
# deny from .ru
# My manually entered IP blocks below
# see CIDR specification at http://logi.cc/nw/NetCalc.php3
# also country ranges http://ip.ludost.net/
# Deny from ASIA inc China?
deny from 61.
deny from 202.
deny from 203.
deny from 210.
deny from 220.
# Deny from Nigeria?
deny from 62.56.128.0/17
deny from 83.128.0.0/9
# findlinks bot
deny from 139.18.2.0/24
# Panscient SE bot
deny from 38.99.203.0/24
</Files>

<Files .htaccess>
order deny,allow
deny from all
</Files>

<Files ~ "^robots\.txt$¦^favicon\.ico$">
order allow,deny
allow from all
</Files>

<Files google_sitemap.xml>
ForceType application/x-httpd-php
</Files>

# http://wiki.cmsmadesimple.org/index.php/FAQ/Installation/Pretty_URLs

RewriteEngine on
Options +SymlinksIfOwnerMatch
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^(Baiduspider¦Bigsearch¦Convera¦Download¦email¦Exa¦Express\ WebPictures¦Extractor¦findlinks¦(MS.?)?FrontPage¦Get(Right¦Smart¦Web)¦Gigabot¦Go!Zilla¦Grabber¦Guestbook¦HTTrack¦Image\ Stripper¦Image\ (Strip¦Suck)¦InternetSeer¦Leech¦MetaProducts\ Download\ Express¦miniBot¦Mozilla.*NEWT¦Mozilla.*Indy¦Mozilla.*DnloadMage¦Mozilla.*WebCapture¦Mozilla.*DreamPassport¦Mozilla.*DnloadMage¦Mozilla.*AspTear¦MSFrontPage¦(Microsoft\ Scheduled\ Cache\ Content\ Download\ Service)¦Microsoft.URL¦MIDown\ tool¦MSIECrawler¦msnbot-media¦Net.?(Ants¦ResearchServer¦Spider¦Vampire¦Zip)¦Oe\ Pro¦Offline\ (Explorer¦Navigator)¦OmniExplorer_Bot¦Seekbot¦Sensis\ Web\ Crawler¦ShopWiki¦Siphon¦sitecheck.internetseer.com¦Speedy\ Spider¦SQ\ Webscanner¦Sucker¦Surfbot¦voyager¦Website) [NC,OR]
# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT [NC,OR]
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2]¦6\.0)(\)¦;\ [^)]) [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://www\.*stuff.*\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://www\.iaea\.org [NC,OR]
# Block random-letter. non-Mozilla user-agents
RewriteCond %{HTTP_USER_AGENT}!^Mozilla
# 15 or more chars with no "/.{};" characters
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC]
# no vowels after 5 characters
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F]

# BLOCK blank Referer -AND- UA (except for HEAD and favicon requests)
# thanks to jdMorgan on webmasterworld.com
# RewriteCond %{REQUEST_METHOD}!^HEAD$
# RewriteCond %{HTTP_REFERER}<>%{HTTP_USER_AGENT} ^<>$
# RewriteRule!\.ico$ - [F]

# Remove multiple contiguous slashes in URL (up to three instances)
#RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
#RewriteRule . - [E=qRed:yes,E=myURI:%1/%2,C]
#RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
#RewriteRule . - [E=myURI:%1/%2,C]
#RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
#RewriteRule . - [E=myURI:%1/%2]

# Safer Anti HOTLINKING code
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!mysite\.net [NC]
RewriteCond %{HTTP_REFERER}!mysite\.co\.uk [NC]
RewriteCond %{HTTP_REFERER}!mysite\.eu [NC]
RewriteCond %{HTTP_REFERER}!search\?q=cache [NC]
RewriteRule ^([^.]+\.(jpg¦gif¦png¦bmp))$ /htaccess/showpic.php?pic=$1 [NC,L]

#redirect 301 /nanou http://www.domain.co.uk/newsletter
#redirect 301 /nanouska/default.css http://www.domain.net/templates/another.css

# <IfModule mod_php4.c>
# php_value auto_prepend_file "/home/mysite/public_html/htaccess/runawaycrawlers.php"
# </IfModule>

jdMorgan




msg:3229270
 9:42 pm on Jan 23, 2007 (gmt 0)

You may use only one Order [httpd.apache.org] directive in your .htaccess file, and "Order" has nothing to do with the sequence of Allow and Deny directives which follow it; Rather, it controls their priority. This is the likely cause of your "Deny from env-ban" failure.

Also, make sure you flush your browser cache before testing newly-uploaded code. A "forced reload" is not sufficient, you mst flush the cache (delete Temporary Internet Files if you use MSIE).

Do not use the mode_rewrite code which refers to "E=myURI" without also using the final rule in the thread where you found it; This code cannot be used "stand-alone."

I haven't the time to take more than a cursory look at this code, and I'd like to ask that you test the code yourself and ask specific questions about coding technique or specific problems with specific small snippets; We cannot support code reviews here.

Jim

easygoin




msg:3229890
 9:26 am on Jan 24, 2007 (gmt 0)

Thanks Jim, and I am sorry about the code, but couldn't see any other way of finding where the error might be... I will post direct "snippet" questions instead of whole reviews ;)

As regards your comments with the order directive - in which way could I "combine" all these <file> order deny allow rules to specify those rules on each of those files - some files deny and some files allow.... I have looked around for the info and checked the apache manual (of course), but I can't see any way of merging them together?

Is this correct (ips replaced to protect the not so innocent)?
---------------
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
# deny from .cn
# deny from .cz
# deny from .ng
# deny from .ph
# deny from .ru
# My manually entered IP blocks below
# see CIDR specification at [logi.cc...]
# also country ranges [ip.ludost.net...]
# findlinks bot
deny from #*$!.xx.x.0/24
# Panscient SE bot
deny from xx.xx.#*$!.0/24
# Deny from ASIA inc China
deny from 61.
deny from 202.
deny from 203.
deny from 210.
deny from 220.
# Deny from Nigeria
deny from 62.56.128.0/17
deny from 83.128.0.0/9
# Mine
deny from xx.xx.xx.x
</Files>

<Files .htaccess>
deny from all
</Files>

<Files ~ "^robots\.txt$¦^favicon\.ico$">
allow from all
</Files>
-----------------------------
Can you kindly start me off on the right structure if this is wrong or can be condensed in some way kindly?

easygoin




msg:3230008
 11:39 am on Jan 24, 2007 (gmt 0)

lol update if interested - it turns out it was the following that caused the problem, not sure why but, by trying each "directive" and stripping out the other stuff one by one I figured it was this code below that seems to break "it".

# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT [NC,OR]
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2]¦6\.0)(\)¦;\ [^)]) [NC,OR]
# Block random-letter. non-Mozilla user-agents
RewriteCond %{HTTP_USER_AGENT}!^Mozilla [NC]
# 15 or more chars with no "/.{};" characters
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC]
# no vowels after 5 characters
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F]

jdMorgan




msg:3230232
 3:23 pm on Jan 24, 2007 (gmt 0)

On the "Order" problem:

Since you already stripped out the redundant "Order" directives, that part should now work.

Regarding the rewriterules that you found to be causing a problem, the trouble is that you over-condensed it, and in doing so, disabled the spoofed and random user-agent denials because you improperly mixed [OR]ed RewriteConds with [AND]ed RewriteConds. That code will likely work much better like this:

# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT [NC]
RewriteCond %{HTTP_USER_AGENT} !Windows\ NT\ (4\.0¦5\.[0-2]¦6\.0)(\)¦;\ [^)]) [NC]
RewriteRule .* - [F]
#
# Block random-letter. non-Mozilla user-agents
RewriteCond %{HTTP_USER_AGENT} !^Mozilla [NC]
# 15 or more chars with no "/.{};" characters
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC]
# no vowels after 5 characters
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F]

Replace all broken pipe "¦" characters in the code above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

You cannot simply take snippets of mod_rewrite code and add [OR] flags to combine them; You must first understand the AND/OR logic of the RewriteConds and preserve it. For example, the first rule says, "If the UA contains "Windows" (any uppercase/lowercase variation) AND if it does not contain exactly "Windows NT" followed by a valid version number, then deny access." RewriteConds lacking [OR] are ANDed -- both must be true for the rule to be invoked.

Jim

easygoin




msg:3230288
 4:17 pm on Jan 24, 2007 (gmt 0)

Thanks Jim, I had tried without the [NC,OR] for the Windows NT rule as you originaly posted it, but I must have tried it with the "no case" and "or" rule "just in case" lol and left it in there when I pulled some more hair out and then posted the code :) also I remembered to replace the broken pipe with solid thanks ;)

Another thing I just noticed as well - there doesn't seem to be a space before the!Windows\ NT .... I am sure there was in the htaccess file...ho hum will check and try again.. thanks

It works who-hooo (happy chappie) thanks, it didn't work the first time... scratch the head, oh yes broken bl**dy pipes!

one other thing lol - where we have the <files> allow,deny I have been trying to get "allow all" to overide the deny on "robots.txt and "*.ico" files, but not sure how if I have them in this sequence:

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
....
</Files>

further down this....

<Files ~ "^robots\.txt$¦^favicon\.ico$¦^403\.shtml$">
allow from all
</Files>

but when i try to call the robots.txt file using a blank UA or a blocked UA - it gives me the 403 fingers up? can you advise?

jdMorgan




msg:3230595
 8:19 pm on Jan 24, 2007 (gmt 0)

Firstly, this line is meaningless, since it accepts *all* files:

<Files ~ "^.*$">

so you can get rid of it, and its closing tag as well.

As for the rest, I'd suggest:

Order [b]Deny,Allow[/b]
#
<FilesMatch "^(robots\.txt¦favicon\.ico)$">
Allow from all
</FilesMatch>
#
<FilesMatch "^\.(htaccess¦htpasswd)$">
Deny from all
</FilesMatch>
#
Deny from env=ban
Deny from 61.
Deny from 202.
Deny from 203.
Deny from 210.
Deny from 220.
# Deny from Nigeria?
Deny from 62.56.128.0/17
Deny from 83.128.0.0/9
# findlinks bot
Deny from 139.18.2.0/24
# Panscient SE bot
Deny from 38.99.203.0/24

Deny,Allow
The Deny directives are evaluated before the Allow directives. Access is allowed by default. Any client which does not match a Deny directive or does match an Allow directive will be allowed access to the server.
(Emphasis added)

The space between "}" and "!" is deleted by the forum software, as are multiple consecutive "!" characters. This is done to prevent wasteful posting habits, but is one of the things we have to deal with when posting code here. Either use multiple spaces, or use the [ smilestopper ] BBcode ahead of the exclamation point to prevent this.

Jim

easygoin




msg:3231213
 9:37 am on Jan 25, 2007 (gmt 0)

Thanks for that pointer and the code :) I see now the order of the directives, it didn't make sense before to me - I thought/assumed that order deny,allow HAd to be in that order ...like deny from xx.xx.xx then allow from xx.xx.x etc, over simplistic resoning.

One other thing it still blocks access to the rewrite rules below that for some of those bots, including now the blank UA and referers, NT with strange characters etc, so I bypassed that by using this rule below the rewrites to "make sure" those bots etc can read the robots.txt file (and not get trapped if they decide to behave). My question is do you think this wise or necessary:

# RewriteRule .* - [F]
RewriteRule!robots\.txt$ - [F]

jdMorgan




msg:3231686
 5:09 pm on Jan 25, 2007 (gmt 0)

You can do that, or simply add a rule above all or most of your Rewriterules to bypass them completely for robots.txt requests:

RewriteRule ^robots\.txt$ - [L]

This simply stops all further mod_rewrite rule processing if the request is for robots.txt.

This is useful if you have many rules using [F], since you won't have to code the robots.txt exclusion into all of them.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved