Forum Moderators: phranque

Message Too Old, No Replies

How (and Where) best to control access

What method of access control has the least effect on the server/optimizati

         

carfac

5:46 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi!

I have about 7 hosted domains on my server. Basically, after looking at my logs one day, I noticed I was getting scanned a lot by 'bots and such, so I did some research, and came up with a list of people/places/UA I want banned.

I did this in a three-fold way. My first line is a prett long robots.txt. Sure, that only helps with polite robots, but better to start there.

My second line is a series of rewrite rules based on HTTP_USER_AGENT, HTTP_REFERER and REMOTE_ADDR variables- as I am sure all of you know, you have to go after all three, depending on the robot. All of this is put into my httpd.conf...

But there are still others that come random IP's, mask UA and have no referer- mainly web cachers, accellorators and the like (MS Front Page comes to mind!). I found a nifty script on this site that I named to a file excluded in my robots.txt... but these sorts of renegades will run the excluded file. As soon as they do, their IP is logged in an .htaccess file, and they are excluded.

So...

I have some questions about this. My ReWrite is now quite long, and times 7 as it is in the <Directory> of each virtual server. So I have a very swollen httpd.conf. Can I put something like this into one place (and where!) in my httpd.conf file so it will effect all virtual servers? Or, failing that, can it be a side file that all the virtual servers would call (as opposed to being in the httpd.conf proper).

Also, is rewrite the best way to handle these (for speed), or should I change to a method that uses mod_access? Or is there anther optimized way to do this? Or is the way I have it now about the best?

The third method nails 2-3 people a day... so I modded that script to add a date tag, and I flush it every week or so. I check the IP's, if they seem to just be some hack trying to steal my pages, I just drop them; if they are weird corprate things (like trademarksearch.com or iaea.org) they get added to the rewrite list... again, having one list rather than 7 would make administration easier.

Any thoughts would be welcome!

Thanks!

dave

jdMorgan

6:00 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dave,

Welcome to WebmasterWorld!

I can't authoritatively answer your question about moving your access rules to a "side file" - as far as I know they need to be in httpd.conf, or in the .htaccess file in your top-level directory - or you can do per-directory rewrites using an .htaccess file in any subdirectory. I believe that httpd.conf is much more efficient, as processing occurs in a much earlier API phase, and this avoids recursion under several circumstances.

But the ,main reason I'm replying is to raise a flag about iaea.org. Remember that this shows up as a faked {HTTP_REFERER} - not as a {REMOTE_HOST}.

Also, be careful to avoid banning caches - You must view them as acting as proxies for real users - some legitimate, and some not.

It looks like you are on the right track - you need a lot of different methods to catch most of them.

Jim

carfac

8:47 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan:

Thanks for the reply- and the welcome. I have been lurking for a week or two, and I have learned a LOT here.

RE iaea.org- point well taken. Note that I did say my rewrite was based on HTTP_USER_AGENT, HTTP_REFERER and REMOTE_ADDR, depending on the problem. I beleive that iaea.org is the only HTTP_REFERER I have banned in httpd.conf right now, so I did catch that one right!

RE banning caches- I think the third (script based) method would be the one that might catch those, and as I clean that weekly, I think I am ok with that. Unless I do not understand what you mean...

That script thing works off of .htaccess files, but should only have 10-25 IP's in it at any time. My understanding is that the .htaccess is where I will really see a slow-down, so I try to keep that trimmed. A regular IP will get out of .htacces in a week, and only "graduate" to the htttpd.conf if it looks like a bad site practice (like the trademarks searches, or iaea.org).

Do you know if I can move the seven instances of rewrite rules out of the seven virtual hosts, and into the section two area? I know you can put <allow, deny> statements in that area, but I am not sure if that will apply to all the Virtual Servers. And would there be a performance penalty (like for .htaccess) for doing that?

Thanks!

dave

jdMorgan

9:34 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dave,

I've never played with virtual servers, so I can't advise you there, except to say, "Well, ya can try it - late at night, when your traffic's minimal." ;)

The performance penalty generally goes down as you move up higher in the server code execution path. Try to make the ban files as efficient as possible, combining multiple bans on one line, for example. How much this overhead affects your server's performance depends heavily on how much traffic you get. With a few thousand hits per day, you'll never notice it. Raise that to a few thousand hits per hour, and you might see a difference.

Caches sit between the user's browser and your server. There is usually one in the browser, maybe another one in a proxy server on the user's machine (I have one as part of my internet-access-sharing suite), there's one at the ISP, and there may be more in the network path from the user's ISP to your server.

Any of these can drop or modify the USER_AGENT or HTTP_REFERER, so you have to watch out. Banning a caching proxy is essentially banning a user - or perhaps a whole bunch of users. If for example, you ban an AOL cache, then you may cut off hundreds or thousands of AOL subscribers.

These caches work by keeping a copy of your pages "closer" to the user. This speeds up access, and reduces the load on the 'net and on your server. By modifying the Expires headers sent with your server's responses, you can control how long the cached copies are kept - from seconds to months.

But, it is best to make sure that you DO NOT block most caches.

Jim

carfac

10:05 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan:

Thanks again!

OK, I understand what a cache does (thanks!)...

Is there any clue to tell when something is a proxy? I would NEVER ban anything that resolved back to AOL (that is easy), and I do not ban (for more than a week) what I can tell is an ISP... any other clues? (I check them all at samspade...)

Also, is this what you mean:

Current Code:

RewriteCond %{HTTP_USER_AGENT} ^DIIbot.*[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download.*Demon[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector.*[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon.*[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf.*[NC,OR]

Optimized:

RewriteCond %{HTTP_USER_AGENT} ^(DIIbot.*¦DISCo¦Download.*Demon¦eCatch¦EirGrabber¦EmailCollector.*¦EmailSiphon.*¦EmailWolf.*)[NC,OR]

(I do not know how that will display on the forum, but that is intended to be on, long line)

If I have that right, is there a line length (256 maybe) that I should be under?

Thanks again!

dave

carfac

10:06 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, and I would intend some white space between EmailWolf.*) and [NC,OR] ...

(I have made that mistake ONCE!

dave

andreasfriedrich

10:11 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I believe that you cannot put RewriteRules common to all VirtualHosts into the main section of your httpd.conf, though the docs do seem to suggest that any setting in the main section will be inherited by the VirtualHosts.

I have a SSL enabled server running as a VirtualHost. I put a

RewriteRule /test/ [server...]  [R,L]
into the main section and then requested
https://server/test/
and did not get redirected. Instead I got a 404 since there is nothing like /test/ on the server.

I´m being rather vague here since I did not expect Apache/mod_rewrite to work that way. But that might very well be due to my limited technical knowledge of it.

Instead of using mod_rewrite you might consider using mod_perl and a variant of the Apache::BlockAgent module explained in Writing Apache Modules with Perl and C. But that´s only worth thinking about if you have mod_perl running anyway.

[edit]corrected some typos[/edit]

[edited by: andreasfriedrich at 10:19 pm (utc) on Sep. 6, 2002]

carfac

10:13 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Anbd for a bonus!

Here is how I dealt with iaea.org:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?iaea.org.*$ [NC,OR]

That handles it nicely. Maybe overkill....

andreasfriedrich

10:15 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is there any clue to tell when something is a proxy?

A proxy will set the HTTP_VIA field of the request header (Test using the Browser Header Checker [searchengineworld.com]. So you might want to check for that. However, any bad robot could set that field just as easily.

carfac

10:22 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



andreasfriedrich:

Thanks for that. I had played with mod_rewrite like that and NOT gotten it to work, either... but I am very novice at that. I thought I might be doing it wrong...

But I do know that allow,deny works in the main server section... so I was thinking that if I went from mod_rewrite to the allow/deny model I could put it up there. Something like this:

SetEnvIf Remote_Addr ^66\.74\.x\.xx$ ban
SetEnvIf Remote_Addr ^24\.30\.xxx\.xxx$ ban
SetEnvIf Remote_Addr ^24\.126\.xx\.xxx$ ban
SetEnvIf Remote_Addr ^200\.37\.xx\.xx$ ban
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>

(I purposely XXXed those IP's...)

My questions for this is:

Would it cover all access to the box (that is, would it be enforced on all Virtual Hosts?)

Can I do a "SetEnvIf User Agent"?

Would this be as efficient (or less or more so) than mod_rewrite?

I will look into Apache::BlockAgent, as I do run Mod_Perl (and love it!)

Thanks!

Dave

andreasfriedrich

10:57 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



deny from will work on all VirtualHosts unless you overwrite that setting in the VirtualHosts section.

As for the performance I would suspect it to be slower than the mod_rewrite solution.

If you are running mod_perl and you love it and know it, then I would definately do it that way. It´s more flexible, rather fast, allows your bad browsers to be stored in an external file, etc. When it comes to the RE you will probably want to use something like Jeffrey Friedl´s closure trick to speed up the matching.

jdMorgan

11:46 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nobody mentioned this, but has everyone tested with a "RewriteEngine On" directive at the beginning of the rewrite stuff?

When combining User-agents on one line, be careful to pay attention to whether the patterns are anchored, and if so, whether at the beginning or end of the string (or both). Basically, I combine mine alpabetically, with separate groups for each:

^(a¦b¦c)$ [NC,OR]
^(d¦e¦f) [NC,OR]
(h¦i¦j)$ [NC,OR]
(k¦l¦m) [NC,OR]

Where a,b,c,d, etc. are UA patterns to match.

Some of the "ban lists" you'll find floating around here have errors in regard to how the UA string is anchored, and therefore some of the entries won't work as expected.

Also, don't forget to escape all periods, parens, etc. with a preceding backslash, especially in IP numbers!

Note that ^xyz.*$ is equivalent to ^xyz
^.*abc$ and abc$ act the same, and ^.*def.*$ can be shortened to just def

The point is that usually you don't need .* next to a beginning or end anchor, unless you want to enclose it in parenethesis to create a backreference for later use...

Look around here using the site search - there are several threads with links to good regular expressions pattern-matching tutorials.

Please post if you get the single rewrite file working for all your virtual servers. I can't do that now - and don't need to - but I'm very interested for future reference.

Thanks,
Jim

andreasfriedrich

11:50 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nobody mentioned this, but has everyone tested with a "RewriteEngine On" directive at the beginning of the rewrite stuff?

It goes without saying that you need to turn the rewriting engine on. And in fact it was and still is turned on in my httpd.conf - both in the main section and in the VirtualHost section.

jdMorgan

11:58 pm on Sep 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



andreasfriedrich,

In my business, I've learned that it never "goes without saying" ... :)

I'm very interested in whether the mod_rewrite method can work in this configuration, since I agree with you that it is probably more efficient. I just want to make sure that the test didn't fail because that step was omitted.

Jim

andreasfriedrich

12:13 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, sometimes it´s the small things that get overlooked.

I was so vague in my post saying that there is no inheritance because there is quite a lot of other rewriting and mod_perl stuff going on in my test server that I wasn´t sure whether there might by any side effects preventing the inheritance.

I guess to test whether any rewriting directives in the main section affect the VirtualHost sections it would be good to set up a new and fresh server. It´s just a question about who will do it? I won´t if there are a lot of others trying it.

Perhaps we could agree on that before going to work.

carfac

1:24 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

Just a note: I, too, did NOT forget to turn it on... I just did not add it to the post for brevity...

I ran a test on a low user VH, removing the Rwrite from the <VirtualHost> tag, and putting it up in Server Configuration. I could NOT get Rewrite to work taht way at all.

Been looking into Apache::BlockAgent, and that looks VERY promising. You add 2-3 lines of code to each VH, and have one MASTER text file of all the bans. It ONLY works for UA, or so I am lead to believce at this point.

OK, I am going to show how stupid I am here, but here goes: what did you mean by: "pay attention to whether the patterns are anchored"

I do not know what "anchored" means, and I guess that is why some of my expressions are so messy! I have been starting ALL match strings with a ^ no matter what, and addingb a .* if I thought there was any possibility of "extra" stuff after the string (version number or whatever) and then a $ at the end- I guess assuming the ^ and $ were like quotes. My read of what you wrote is I do not have to do that.

So would ^web match "Webmaster" and "AWEB"? (Note, I do use NC for all these, just in case...)

I get so mixed up with all the different languages... "*" matches anything here, and ".*" does there...

dave

carfac

1:35 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

I just found this

[webmasterworld.com...]

so I am reading up on Regular expressions...

I am still confused about what "anchor" means, though...

carfac

1:39 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nevermind- found anchor!

jdMorgan

1:47 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



^ start anchor
$ end anchor

^exact$ matches the word "exact" only.
^begin matches anything that starts with "begin".
finis$ matches anything that ends with "finis".

Therefore, ^anything.*$ can be replaced with ^anything, because omitting the end anchor is the same as ending the match pattern with ".*$".

And ^.*everything$ can be replaced with everything$, because omitting the start anchor does the same thing as "^.*".

So, unless you are creating a back-reference, you never need ".*" at the beginning or end of a pattern.

A back-reference, where you might have ^(.*)\.html$ for example, requires parenthesis, so it really doesn't break the rule above...

Some good info [etext.lib.virginia.edu] on regular expressions.

I went and dug around in the Apache docs, and it looks like mod_rewrite will not work for multiple virtual servers. From the second paragraph under Module mod_rewrite - API Phases:

"So, after a request comes in and Apache has determined the corresponding server (or virtual server) the rewriting engine starts processing of all mod_rewrite directives from the per-server configuration in the URL-to-filename phase."

As I read this, it won't do what you want because you want to rewrite before the virtual server is determined. :(

Jim

andreasfriedrich

1:54 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It ONLY works for UA

No. The handler subroutine gets passed a reference to the request object. You can access the whole Apache API.

You can get any http header field from the request using

$any_http_header_field = $r->header_in('Http header field');
To get the referring url use
$referer = $r->header_in('Referer');

andreasfriedrich

2:05 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great or rather sad info Jim.

carfac

2:38 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, sorry for making you repeat yourself jdMorgan... but thanks for the info.

I found a good right-up on Apache::BlockAgent here:

[gd.tuwien.ac.at...]

It explains it all there, but I do not get one thing...

I assume you take that hunk of code and save it to the server somewhere, and then you call it from each VH with:

<Location />
PerlAccessHandler Apache::BlockAgent
PerlSetVar BlockAgentFile /home/www/conf/bad_agents.txt
</Location>

So, how (and where!) would you save the file (or how you you install it?) so it can be called with: PerlAccessHandler Apache::BlockAgent ?

I tried CPAN to install it, and it was not there...

dave

jdMorgan

3:44 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don't get me to lyin' !

(Translation: I don't have any idea.)

But others here likely do know...

Good luck with this.

Jim

carfac

3:48 am on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan:

Well, your help with the Regular expressions helped cut 40K from my httpd.conf file, so I thank you for all that help! Sorry for making you repeat some of that- your other post was very insightful!

I got rid of a LOT of ".*$" at the end of lines... and I combined 6-8 UA's into one libe, which saved a LOT of space, too... so thanks!

dave

PS, I assume it is not appropriate to post my rewrite stuff, but if you would like, I can sticky it to you, if you want!

andreasfriedrich

1:04 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, how (and where!) would you save the file (or how you you install it?) so it can be called with

That´s easy. mod_perl is just a perl interpreter running in the apache process. Apache::BlockAgent is just a normal Perl module. So it goes whereever Perl modules go. They should reside in a directory that is accessable by the user the webserver is running as and that is in the Perl @INC array.

The canonical way is to have a

PerlRequire /path/to/startup.pl
in your httpd.conf.

Your startup.pl might look like this:


#!/usr/local/bin/perl -w

# make sure we are in a sane environment.
$ENV{GATEWAY_INTERFACE} =~ /^CGI-Perl/ or die "GATEWAY_INTERFACE not Perl!";

# add directories to the @INC array
use lib qw(
/var/www/html
/etc/httpd/lib/perl/EigeneModule/WebDev/lib
/etc/httpd/lib/perl/EigeneModule
/etc/httpd/lib/perl2
);

# preload and precompile some modules
use Apache::Registry;
use Apache::RegistryLoader;
use Apache::Status;
use Apache::DBI;
use Apache::Session::MySQL;
use Apache::AuthDBI;

use strict;
use CGI; CGI->compile(':all','escapeHTML');
use MLDBM qw(DB_File Storable);
use File::Find;
use Fcntl;
use DBI;
use DBD::mysql;

# start database connections
DBI->install_driver("mysql") or die "Couldn't install mysql driver";

Apache::DBI->connect_on_init("DBI:mysql:database=pension;host=localhost",
"apache",
"aaron/8",
{'RaiseError' => 1})
or die "Couldn't connect to database: DBI->errstr()";

# preload your own modules but start the server anyway even if they fail loading
eval {
use MyPortal;
use WebDev::SiteStatistics;
};

1;

carfac

5:37 pm on Sep 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

For what it is worth, The normal Apache::BlockAgent just blocks UA field. With Andreas' help (Thanks a TON, Andrea!), I modded Apache::BlockAgents into Apache::BlockIP, which will block based on IP rather than UA.

Don't search for it... if you wantit, just sticky me, I am happy to pass it on. But, a word or two of advice- you MUST have root access to the server! Also, I would get Apache::BlockAgents working FIRST, and then adding Apache::BlockIP will fit right in.

Again, a big round of applause for Andreas!

Thanks!

dave

andreasfriedrich

10:46 pm on Oct 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is an update to the question of How (and Where) best to control access or How to centralize administration of things to block?

Can I put something like this [list of UAs, IPs, Referrers, etc.] into one place (and where!) in my httpd.conf file so it will effect all virtual servers?

The problem with using mod_rewrite is that the substitution is done after Apache determined the (virtual) server to use and RewriteRules are not inherited by virtual server sections.

You would need to specify the rules each and every time for each (virtual) server. To solve this problem efficiently I suggested [webmasterworld.com] using mod_perl and a variant of the Apache::BlockAgent module explained in Writing Apache Modules with Perl and C.

However, if you have root access there is a different solution using mod_rewrite [webmasterworld.com].

Andreas

andreasfriedrich

1:25 pm on Oct 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It´s amazing what reading the documentation can do for you. In the last update to this thread I wrote:

The problem with using mod_rewrite is that the substitution is done after Apache determined the (virtual) server to use and RewriteRules are not inherited by virtual server sections.

This should read: are not inherited by virtual server sections unless told so. Setting RewriteOptions [httpd.apache.org] to inherit in the per-virtual-server context this means that the maps, conditions and rules of the main server are inherited by the virtual server.

Why didn´t somebody tell me this?

Andreas