A Close to perfect .htaccess ban list - Part 3 - Apache Web Server forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list - Part 3

More tips and tricks for banning those pesky "problem bots!"

txbakers

7:38 pm on Oct 13, 2003 (gmt 0)

Continued from A close to perfect .htaccess ban list - Part 2 [webmasterworld.com]

Whee - what a great discussion.

[edited by: Marcia at 11:23 pm (utc) on Oct. 13, 2003]

[edited by: jdMorgan at 12:24 am (utc) on Nov. 19, 2003]
[edit reason] Corrected URL [/edit]

pmkpmk

4:39 pm on Feb 6, 2004 (gmt 0)

Geesh, this thread is STILL going on?

I was active in the previous threads, and I took great efforts of time and resources to have my htaccess-list growing and always keeping up to date. During this process, I did a lot of errors, one of the most stupid ones I only found out quite some time later: I unintededly locked out the Inktomi bot. Good for me, I have my content mirrored under a different domain name and Inktomi indexed that one. Still this causes trouble but that's a different story.

After quite a while, I found out the maintaining my htaccess list is a race I couldn't win. I'm always one step behind the bots, and my few expeditions in the "enemies camp" (i.e. searching for spambot-software on the net and trying it out) shows that the spambot programmers seem to read these threads as well. Folks, there are spambots out there you probably can't differentiate from legitimate surfers AT ALL!

I decided to give up. Maintaining that list took up so much time, and the errors were always on the expense of the legitimate surfer, so I simply stopped it.

In the meantime I switched my CMS, and we changed ALL our email addresses in our company. I designed our new webpages in a way that there are only a few visible web-addresses at all. These are easily to change once they are "contaminated", and on top of that, they are cloaked using CSS-invisible-tags and JavaScript.
All the other addresses were replaced by feedback forms, and a short message explaining WHY we don't list addresses anymore and that our telephone operator is instructed to hand out mail addresses if one asks for them.

The new pages are online for a week. Already now, we are getting much more feedback via the feedback forms than we did get legitimate emails before.

On top of that, we enhanced our mailserver with several spam-guards to filter out the top 95% of spam.

Tell you what? My life as webmaster has become EASIER! I'm relaxed now, not browsing the logfiles for new spambots. Not rying to tweak the htaccess. Not worrying about errors and legitimate bots I kick out. The time I GAINED I can actually spend on getting more CONTENT to my pages!

This works for me: I gave up a race I could never win. And by giving up, I got more than I thought. Maybe one or the other starts to think about that race as well.

My thanks to all of you in these threads who helped me on my way along, especially JDMorgan who was extraordinarily helpful. Don't get me wrong. This IS a great thread, and I learned a lot in it. But I'm only speaking for me, and for ME, giving it up was the better choice!

jehoshua

5:25 am on Mar 16, 2004 (gmt 0)

Hi,

This sure is a long thread alright. In regards to placing banned agents in .htaccess, has anyone banned these?

httrack
RPT-HTTPClient/0.3-3
www.almaden.ibm.com/cs/crawler (IP 66.147.154.3)

Thanks,

Peter

Wizcrafts

5:47 am on Mar 16, 2004 (gmt 0)

jehoshua asked:
In regards to placing banned agents in .htaccess, has anyone banned these?
httrack
RPT-HTTPClient/0.3-3
www.almaden.ibm.com/cs/crawler (IP 66.147.154.3)

I do block httrack. I haven't yet seen RPT-HTTPClient, so I can't comment on it. I don't block the IBM crawler because I enjoy being listed in the IBM search engine index.

What I have found lately is that I am blocking a lot more intruders by their IP address, or even entire ISP IP blocks. I block most Chinese and Korean ISP's, Nigerians, Brazilians, the Phillipines, and ISP's from several other countries. I don't have any business with anyone from these parts of the world, and could care less if one of their members gets 403'd when he tries to steal information from my website. I find that when I review my web logs and I see suspicious Gets and Posts, or odd user agents, that a lot of them come from APNIC, LATNIC, and RIPE IP blocks. Look at your web logs for people using Green Research, Larbin, Mozilla 4.06 Win95 I, Indy Library, Mozilla/3.0 (compatible), Missigua Locator, Microsoft URL Control, HLoader, and other bots with names that include "Extractor, Viper, Collector, Siphon," etc. These are people you will probably wish to exclude from having any access to your files and folders. Be careful not to block legitimate search engine crawlers.

Wiz

jehoshua

11:59 pm on Mar 16, 2004 (gmt 0)

Hi Wiz,

I do block httrack. I haven't yet seen RPT-HTTPClient, so I can't comment on it. I don't block the IBM crawler because I enjoy being listed in the IBM search engine index.

It seems that some sites list RPT-HTTPClient as 'naughty', whatever that means. So, the IBM one seems okay, I didn't know they had a search engine, I thought the agent was just for their own internal 'research'.

What I have found lately is that I am blocking a lot more intruders by their IP address, or even entire ISP IP blocks.

How do you "block blocks" > I do a single block in .htacess by:

deny from66.147.154.3

Look at your web logs for people using Green Research, Larbin, Mozilla 4.06 Win95 I, Indy Library, Mozilla/3.0 (compatible), Missigua Locator, Microsoft URL Control, HLoader, and other bots with names that include "Extractor, Viper, Collector, Siphon," etc. These are people you will probably wish to exclude from having any access to your files and folders. Be careful not to block legitimate search engine crawlers.

I use Mozilla myself, I'd have to be careful about the strings, as you say 'be careful not to block legitimate search engine crawlers', and I'd have to add 'be sure I don't block legit web users. My Mozilla 'about' says:

Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.5) Gecko/20030916

and that's what the logs say.

Thanks for your help,

Peter

Wizcrafts

2:07 am on Mar 17, 2004 (gmt 0)

Peter asked Wiz:

How do you "block blocks"

Simple. Place a <Files *> section before your rewrite conditions. Deny access by IP address, IP range, or by friendly domain names. Here is an example from my own htaccess file (there is nobody innocent mentioned here; they all earned a spot in my block-list):


<Files *>
order deny,allow
deny from 61.4.64.0/20
deny from 63.148.99.224/27
deny from 65.118.41.192/27
deny from 210.192.96.0/17
deny from 217.78.
deny from 211.161.24.128/26
deny from 218.15.
deny from 218.64.
deny from 218.65.0.0/17
deny from 219.147.128.0/17
deny from 219.147.174.0/24
deny from netvigator.com
deny from mail.whitepine-ventures.com
deny from boxpaper.com
</Files>

All of the above denials are for blocks of IP addresses, or for entire ISP's. An IP range is indicated by adding a forward slash and a range multiplier, like /17. Another method used here is to type only a partial IP and leave the last one or two quads off, with a period terminator, ie: 218.15., which belongs to Chinanet, home of the world's most prolific spammers. Netvigator is also a Chinese ISP in Hong Kong, and I have had a lot of run-ins with email harvesters and formmail exploiters coming from there. That's why I ban the entire ISP.

I see that I threw you off when I referred to Mozilla user agents. What I am listing are examples of actual phoney user agents, belonging to software used by spammers, harvesters, downloaders, and formmail exploiters. For example, take this innocuous looking U_A:

[b]Mozilla 3.0 (compatible)[/b], or [b]Mozilla 4.06 (Win95; I)[/b]

. At first glance one might wonder what is wrong with them to get them banned? To understand this one needs to know how to differenciate a legitimate user agent from a phoney one. By exchanging information with other members of this forum, and by carefully reviewing my web logs, I have been able to clearly identify these are bogus user agents, and that everybody who visits my website with those exact user agents has been up to no good. Both are automatically 403'd in my .htaccess files, for all websites that I manage. The former is usually looking for email addresses to harvest, while the former is always trying to exploit Formmail scripts. If you review the entire topic about htaccess rules and search engine spider identification (deprecated) you will gain a better insight into where I am coming from.

My best advise is to read your web logs every day, looking for suspicious activity. Watch out for anybody who is not identified as a search spider that downloads only html files or only scripts, avoiding images and ads. I also setup some hidden links that lead to bot traps, which I forbid access to in robots.txt. Only bad bots follow these hidden links, and find themselves self-banned, on the spot.

End of htaccess 101, for now. Be aware that these are my opinions and personal choices, based on my experiences and preferences, and may not work as well in your situation, especially if you do business with people in China.

Wiz

neweb

10:38 pm on Mar 28, 2004 (gmt 0)

Pmkpmk,

I'm really interested in the solution that you found. You said that you "gave up" but I imagine that only came about because you found a CMS system that you had confidence enough in to allow you to feel like it was protecting you enough that you COULD give up?

You said:

"I decided to give up. Maintaining that list took up so much time, and the errors were always on the expense of the legitimate surfer, so I simply stopped it.

On top of that, we enhanced our mailserver with several spam-guards to filter out the top 95% of spam."

Being new to CMS I'm excited about the possibilities and traumitized at the same time ;o) My site has simply gotten to large to maintain by myself without using a CMS system ... so am in the process of moving content over to it now. I'm "shadowing" my content pages right now and as soon as the content has been moved, I just have to remove my .htm files and the CMS system will be in effect. But ... I've been reading so much about the security issues that I just don't know what to do. I can't afford to comprimise security for my convenience. I'm having second thoughts about it ...

Anyway, I'd love to hear what CMS system you're going with now. It may help a few of us "newbies" from having to take the long, hard road.

Thanks!

Darla

isitreal

2:46 am on Mar 29, 2004 (gmt 0)

I'm surprised that BirdMan's php spider trap script isn't getting the notice it deserves, it's clean, simple, and it looks like it really works.

[webmasterworld.com...]

One thing I've seen already is that all spiders I've caught so far are using totally generic Navigator User Agent strings, no spambot, sitegrabber, emailsucker or whatever, just normal strings.

But their behavior gives them away, which is the best way to catch them in the end.

Give it a try if you haven't done it already, to me maintaining spider blocking htaccess lists and reading over your log files is probably not one of the more productive ways you can spend your time.

jehoshua

5:27 am on Apr 1, 2004 (gmt 0)

Hi Wiz,

Thanks very much for the tips on how to ban blocks of IP addresses; very informative.

I see that I threw you off when I referred to Mozilla user agents. What I am listing are examples of actual phoney user agents, belonging to software used by spammers, harvesters, downloaders, and formmail exploiters. For example, take ...<snip>

Okay, you must have done a lot of work, to find out which ones are phony and which are legit.

My best advise is to read your web logs every day, looking for suspicious activity. Watch out for anybody who is not identified as a search spider that downloads only html files or only scripts, avoiding images and ads.

Do you recommend any good tools to analyse the web server logs? All I have is Awstats and Webalizer, but when I see an IP address, and the number of hits == the no. of files, then it is usually a spider.

I also setup some hidden links that lead to bot traps, which I forbid access to in robots.txt. Only bad bots follow these hidden links, and find themselves self-banned, on the spot.

That sounds tricky. Recently I was trying to find out all the IP addresses that belonged to a certain company, it took quite a while; obviously there are 'whois' tools where you can enter a company name, and you get all the IP's.

Thanks for your help,

Peter

Wizcrafts

5:53 am on Apr 1, 2004 (gmt 0)

Peter;

I also include hidden links to a certain script, which is also forbidden in robots.txt. Again, no legitimate crawler has ever gone for that script, only harvesters looking for email addresses ... which they get in droves ;-) When I see that somebody has gone for that file I may add their IP to my Deny From list, just to be safe.

I use my eyeballs to analyze my access log. I search for 403, 404, 405, 410, POST, CONNECT, and a couple of file names I won't mention on this forum. After that I just look briefly at the nature of the various hits. Most fall into common patterns, of normal activity. That makes it easy to spot hostile activity or user agents. They stick out like a sore thumb. And, any visitor who GETs my hidden, prohibited link will draw my attention. I do an IP lookup of their source, then decide if and how to respond. My usualy response is to block the entire ISP if it comes from a country I don't do any business with, or care to. I now block most Chinese ISPs because of the large number of email harvesters coming from guangdong or some other province. I also provide a means of requesting removal from my banned list for any legitimate visitor who gets 403'd because they fell inside a banned IP Block.

You should be very careful when blocking countries or ISPs. Be sure that is in your best interest. If you're not sure, just ban the individual IP address. Then, watch your logs for 403s from repeat visits. If a banned IP is dynamic, you may be blocking an innocent party who just acquired that address and had nothing to do with the other person.

Wiz

pmkpmk

7:35 am on Apr 1, 2004 (gmt 0)

neweb: Since your question DOES lead to off-topic areas, only a quick answer from my side. There's lots of other threads covering the "which CMS for my application" here in WW.

The technique I'm using is not unique to the CMS (in my case Typo3), even though it makes it very easy to use it. In fact any other CMS should be able to use the same technique, and even static, "handmade" pages can benefit from this technique.

The email-address is cloaked in two ways:

1) The legitimate visitor shall be able to see a webadress, an email-harvester has to be confused as much as possible. This is done via camouflaging by CSS elements:

mailaddress<span style="display:none;">.ignore</span>@<span style="display:none;">ignore.</span>mydomain.com

A visitor with a modern CSS-capable browser sees: mailaddress@mydomain.com

A visitor with Lynx or non-CSS-browser sees: mailaddress.ignore@ignore.mydomain.com

An email-harvester (hopefully) only sees garbage. It might recognize the "@" as a trigger-element, but hopefully can't make any sense out of the surrounding stuff.

2) The legitimate user should be able to click on the mail-link and have his favourite mail applicaton open up. An email-harvester must not find any "href=mailto:" link.

This is done by a small Javascript, which gets the mailaddress in an encrypted format, decodes it on the fly and launches the default email application with the mailadress as parameter:

<A HREF="javascript:linkTo_UnCryptMailto('rf6tqnqyt?�jgrtxyjwwxhiuu3ij');">

This actually is a function of Typo3, but as I said, any other CMS and even static pages should be able to be enhanced like this.

The whole block looks like this:

<A HREF="javascript:linkTo_UnCryptMailto('rfnqyt?�jgrfxyjwExhu3ij');">mailaddress<span style="display:none;">.ignore</span>@<span style="display:none;">ignore.</span>mydomain.com</A>

In theory, a smart email-harvester can be trained/programmed to crack camouflaged email adresses like this. But there's an almost infinite number of possible encryptions for the mailadress, and there's also an almost infinite number of possible cloaking via the CSS tags. As long as there are many, many other sites out there which are uch easier to harvest, I think I'm pretty advanced with this technology.

Powdork

9:00 pm on Apr 1, 2004 (gmt 0)

Has anyone had issues or does anyone block this ip 193.***.234.? It's from Romania Data Systems

[edited by: jdMorgan at 1:27 am (utc) on April 17, 2004]
[edit reason] Obscured specifics [/edit]

brainstorm2k3

11:24 am on Apr 16, 2004 (gmt 0)

thanks
This .htaccess is really great =)

max66

3:10 pm on Apr 21, 2004 (gmt 0)

Wow great thread!

I am a bit lost with all of this so I have a simple question about .htaccess syntax (yes, again, sorry), it will be fast:

Is the following command line correct for banning everything containing the string in question?

-> RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]

And with spaces:

-> RewriteCond %{HTTP_USER_AGENT} ^.*Program\ Shareware.*$ [OR]

HUGE thanks if you can reply! :)

jdMorgan

3:37 pm on Apr 21, 2004 (gmt 0)

> RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]

It'll work, but you can shorten it. A start anchor "^" followed by ".*" and an end anchor "$" preceded by ".*" are redundant:

RewriteCond %{HTTP_USER_AGENT} WebZIP [OR]

Is entirely equivalent.

Ref: [etext.lib.virginia.edu...]

Jim

Wizcrafts

3:53 pm on Apr 21, 2004 (gmt 0)

Max66 asked:

Is the following command line correct for banning everything containing the string in question?
-> RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]

Max; You put in a lot more than is needed to ban WebZip agents. Here is all you really need to block any agent containing that string, case insensitive, anywhere in it's U-A string:

RewriteCond %{HTTP_USER_AGENT} webzip [NC,OR]

Note that I removed the ^.* and .$, as they are unnecessary. My example will catch that combination of letters, case insensitive, anywhere in the user agent string.

And with spaces:
-> RewriteCond %{HTTP_USER_AGENT} ^.*Program\ Shareware.*$ [OR]

Again, you have included more than is needed to block this UA. My version of this rule reads:

RewriteCond %{HTTP_USER_AGENT} ^Program.?Shareware [NC,OR], but you can also write it as: RewriteCond %{HTTP_USER_AGENT} ^Program\ Shareware [NC,OR]

The ^ indicates the absolute beginning of a regexp string, while the $ sign means the absolute end. By leaving these out of the expression you allow for a match anywhere within the User Agent string. However, in my experience, Program Shareware is always the beginning of the name, so I anchor the beginning with a ^ but leave off the $, because there may be version numbers appended to it. Notice that I replaced you / with a .? where the space occurs. The reason for this is to allow for creative obfuscation by the users of these programs who might change the space to a dash or underscore, or even a forward slash, in the hopes of breaking our rules. The .? catches zero or more of any character(s) between Program and Shareware, including non-printing spaces. [NC] means No Case.

IMHO, Wiz

jcoronella

10:17 pm on Apr 21, 2004 (gmt 0)

Lots of great ideas here.

Anyone take a look at what a 30 line .htaccess does to your server load? Just a thought. I know that I have sites that I wouldn't put 30 lines of regular expressions into my php code for every page on a heavily loaded site, and this seems the same.

jdMorgan

12:49 am on Apr 22, 2004 (gmt 0)

Yes, it all depends on your site's traffic levels. Putting the code in httpd.conf is best performance-wise, since it gets compiled on server restart. Next-best is .htaccess, with care taken to make your rules selective enough that they don't run for *every* HTTP request (because the code in .htaccess is interpreted for each request). And after that, scripting languages are the third choice, because they are interpreted and are not native to the server software itself.

I've got a couple of sites which have up to 800-line .htaccess files, but because the rules are carefully written, and because the sites get thousands of hits per day instead of tens or hundreds of thousands (or more), they do just fine. The bottom line is that each server and hosted site is different, and you have to test to find out how big is too big for your CPU and your traffic level.

Taking a wider view, the point was made earlier that most sites won't need such a large, comprehensive set of rules, and each Webmaster should use only those rules which provide a real benefit to offset the performance loss they cause.

Jim

max66

9:03 am on Apr 22, 2004 (gmt 0)

Many thanks for the help!

So using this template

RewriteCond %{HTTP_USER_AGENT} string [NC,OR]

will ensure that every U-A containing "string" anywhere in the U-A will be banned?

Again, thanks a lot!

Wizcrafts

2:46 pm on Apr 22, 2004 (gmt 0)

So using this template
RewriteCond %{HTTP_USER_AGENT} string [NC,OR]
will ensure that every U-A containing "string" anywhere in the U-A will be banned?

Correct-a-mundo, Max

Don't forget that if the User Agent contains non-alphabet characters or spaces you can put .? between the last letter of name one and the first letter of name two, with no space between the letters. For example: web.?extract [nc,or] (will catch "website extractor 1.09"), which otherwise would have to be written longhand as: ^Website\ Extractor\ 1\.09$ [or]. The long method would fail if somebody used version 1.10 instead of 1.09.

I personally group all common expressions in one long rule, separating each one with a vertical pipe symbol (which is displayed as a broken pipe on this forum, ala: �). Here is one such grouped condition from my .htaccess:

RewriteCond %{HTTP_USER_AGENT} ^(BlackWidow�Crescent�Disco.?�ExtractorPro�HTML.?Works�Franklin.?Locator�
Green\ Research�Harvest�HLoader�http.?generic�Industry.?Program�IUPUI.?Research.?Bot�Mac.?Finder�NetZIP�
NICErsPRO�NPBot�PlantyNet_WebRobot�Production.?Bot�Program.?Shareware�Teleport.?Pro�TurnitinBot�TE�
VoidEYE�WebBandit�WebCopier�Websnatcher�Website\ Extractor�WEP.?Search�Wget�Zeus) [NC,OR]

Notice that the board has changed my pipes into broken vertical pipes, so you would have to re-type them correctly to use this group rule. The line of User Agents is anchored at the beginning with a ^, because these UAs are known to display in logs as typed, but there is no ending $ anchor. This allows for other characters after the main name, such as version numbers. I have another group rule that is not anchored at the beginning to catch strings that may not be at the beginning of a UA.

These represent my personal choice of which agents to block with a 403 message, and may not apply to other people.

Wiz

[edited by: jdMorgan at 3:08 pm (utc) on April 22, 2004]
[edit reason] Edited long line to fix horizontal scrolling [/edit]

max66

8:27 am on Apr 23, 2004 (gmt 0)

Thanks to all for your precious answers!

This 80 message thread spans 3 pages: 80