homepage Welcome to WebmasterWorld Guest from 54.196.225.45
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Mod_rewrite and MSIE 5.0 ):
All was going so well until...
bsd4me




msg:1499390
 4:28 am on Nov 26, 2002 (gmt 0)

Hi all. Excellent forum, and one of the few that has got me glued. Hereís the problem: Iíve read through the posting: A Close to perfect .htaccess ban list, [webmasterworld.com...] , and Iíve had great success with killing off the site rippers and bandwidth hogs, except for one however.

When I test the site from my Win98 machine and run the MSIE ďMake available offlineĒ option, it seems that MSIE 5.0 displays as a normal browser as it rips its way through the siteó it does not identify itself as Microsoft URL Control, or MSIECrawler. Instead, it identifies itself like this: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt).

Am I missing something here? I tried all of those mod_rewrite recipes, such as these two:

RewriteCond %{HTTP_USER_AGENT} ^.*microsoft.url.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*MSIECrawler.*$ [OR]

But nothing will stop the Win98 5.0 offline reader feature from ripping through my 21,000 page website. Last month, this browser alone sucked up 4 gigs of bandwidth.

Iím blocking by IP, but thatís becoming a real pain. Anyhow, is there some weird reason why the offline parser in MSIE 5.0 is spoofing itself as a normal browser? Is there a mod_rewrite recipe to block it? Thanks to the way Microsoft has managed this one, it appears as if this browser cannot be stopped unless you block it completely, but hey.. it never hurts to ask. Any help would be welcome.

Thanks

Dave

 

jdMorgan




msg:1499391
 4:57 am on Nov 26, 2002 (gmt 0)

bsd4me,

Welcome to WebmasterWorld!

I haven't seen that variant of the "view offline feature" User-agent. In most cases, blocking MSIECrawler, or disallowing it in robots.txt is sufficient. This "normal-looking User-agent" may be a IE5.0 or Win98 thing, though.

You might want to consider using a spider-trapping approach. This can be implemented with .htaccess, a simple PERL script [webmasterworld.com], and a few "spider trap" links salted around your site. You link to a URL that is disallowed in robots.txt, using a link (like a 1x1 transparent .gif) that is not visible to a human browser. Any request for that disallowed file is redirected to the script. The script grabs the requestor's IP address and adds it to a "deny from" directive at the top of your .htaccess file. It then outputs a small non-descript html page to complete the http request with a 200-OK response. However, on subsequent requests, the user-agent receives a 403-Forbidden response from your server, and can no longer access your site. Once a day/week/month, you go through the accumulated IP addresses in your .htaccess file, identifying blocks of related IP addresses that re-occur, and "compressing" the list - maybe using mod_rewrite to make the blocking more efficient for a particularly large or non-contiguous block of addresses. This takes a lot of the work out of blocking by IP.

Jim

bsd4me




msg:1499392
 7:16 am on Nov 26, 2002 (gmt 0)

Heh, I started using something similar to that about a year ago, except without the added luxery of the script. I created a number of directories with these pages in them:

<!--#config timefmt="%y%j%H%M%S" --> <!-- of date string -->
<!--#exec cmd="sleep 20" --> <!-- make this page sloooow to load -->

<a href=i<!--#echo var="DATE_GMT" -->.html>access codes</a>
remote admin <a href=j<!--#echo var="DATE_GMT" -->.html>information is here.</a>
email addresses <a href=k<!--#echo var="DATE_GMT" -->.html>are here</a>

<p><big><big><strong>&nbsp;</strong></big></big></p>

<p><big><big><strong>This is a private area.&nbsp; Unless you're an administrator or have
some other special reason to be here, please leave now!</strong></big></big></p>

<p><a href=l<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=m<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=n<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=o<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=p<!--#echo var="DATE_GMT" -->.html>.</a><br>

I placed all directories with this code in my robots.txt file. This would clog them for hours as they hung there suspended in time :) Oh yeah.. And the links should be hidden- black on black for example.

In any event, thanks for the reply, however the problem is not site grabbers dishonoring the robots.txt file. In fact, since itís Microsoft based, it obeys those rules. The problem is the amount of bandwidth they can draw when someone sets one loose on the site, and then goes off to bed.

They probably donít realize the size of it, which I can understand, but hellÖ Thatís costing me plenty of money, not to mention the beating it puts on the server. Someone doing this last week drove the load up to 4.10 until I blackholed them. I guess on a high-speed connection, your request rate is only limited by your throughput.

That script would be killer if there was a way to:

- Cache the IPís of current users via an SSI call to the script
- Set a threshold of how many pages they can load per minute
Maybe have the script deny access if it detects X-number of access within X-number of seconds.
- Maybe ability to make legitimate spiders such as Google exempt from the rule set.
- Block offending IP for an hour, a day, or indefinitely.

Even the old Mattís wwwboard had a module like this. You would install it if you wanted to limit the times between posts based on an IP. I wish I knew more about building my own scripts. I wonder though; would this even be possible, or would it put too much of a load on the server?

I mean.. we run all sorts of other SSI based calls, and they donít see to slow things down. I suppose it would be reliant on how the script was written. At any rate, this continues to be a problem, and the means becoming more aggressive all the time. As I see it, the only way to shut this down is to take control over their access times.

I watch legitimate spiders all the time, and can never recall a maximum of 1-page per minute. Most of them are a page per 5-minutes, or in some cases several hours, so I canít see a system like this messing them up. Anyway, Iím just babbling away here.

Dave

jdMorgan




msg:1499393
 2:53 pm on Nov 26, 2002 (gmt 0)

bsd4me,

Your thoughts on recording IPs and limiting access echo some of my own. A module called SpeedLimit [modperl.com] is available for Apache server, but I don't have access to it on my hosts.

Maybe when I get better at scripting, I'll be able to do something like that myself. The main problem is one of scaling - An efficient way to search the previously-recorded IPs is needed, so that the system won't break down under heavy load. The critical path is the part of the code that runs while the recorded-IP file is open. In a multi-process environment such as a server, the file must be locked while it is open. This constrains the design of the IP database access code, requiring intense attention to efficiency.

I suppose a "poor man's" method might be to include your SSI "wait 20" - or more likely, "wait 2" - at the end of often-accessed pages. This would slow down a robot, but not a human.

I laughed at the "fake links page" idea... Hey, why not serve a bogus "Blue Screen of Death" page to MS user-agents? ;)

Jim

bsd4me




msg:1499394
 7:49 pm on Nov 26, 2002 (gmt 0)

Yes exactly. I was looking at bwshare, and the newer mod_throttle; both of which offer an impressive solution. The problem however, is they require modifications to Apache and this requires root. The vast majority of us do not have root access, as weíre in a shared server environment, which makes something like this difficult to implement. It also appears as if installing modules like this effect the 'global' environment, and cannot be configured on a 'per user' basis. Obviously, this is no good, as not everyone may want these limits on their sites.

As for creating a script, in theory it all sounds quite simple, but after numerous searches and reading through the web and Usenet, I have found nothing that remotely resembles this. Hmm.. Could it be that no ones thought of it, or are there programming issues that cannot be overcome? Youíre correct in that youíd need a way to efficiently cache numerous IPís, then have the script monitor their usage, and then expire them after a certain amount of inactivity. Indeed, on a busy site, this could cause problems if not implemented properly.

I wonder.. Iím not a PHP guru, but perhaps a PHP/Mysql solution could also be a possibility here. I saw something like this last night, however the way in which the author explains it is somewhat confusing. It may be possible to create a PHP script, which logs all IPís to a database. It may also be easier to manipulate, cache, and expire these IPís under the Msql architecture. The recipe could consist of the main programming.php script, while an <--include virtual--> tag would be placed in your html pages to call it. Of course, if youíre running PHP pages, itís even easier.

Queries? Iím currently experimenting with this on a smaller scale. Iím using a php, Mysql, and CGI solution to kill password hurlers. This script is so incredibly simple, yet powerful. When someone logs into the member section, they get 3-tries before being blocked completely. Hereís how it works:

1. User logs in a failed attempt
2. Error document is pointed to 403.php
3. 403.php sends the failed IP to a Mysql database
4. User tries two more times, and fails
5. 403.php checks this IP (now in the db) against the failed threshold- set to 3 failures
6. 403.php now outputs a new .htaccess file blocking this IP from additional attempts
7. An email is sent notifying me of the blocked IP
8. A cron job is run once every 24-hours to clear the DB
9. A second cron job every 24-hours, which outputs a new .htaccess file
10. Optionally, I can manually clear it by running the script at the shell

Stress test on the DB?

Itís been pounded with a password hurler, with ďrotatingĒ IPís. After all, no one tries to hack a password with one IP. Hereís what I saw. Mysql and 403.php successfully blocked all proxies after 3 unsuccessful attempts. Load on the db server? Negligible. Actually, it was just sitting there smoking a cigar. Even at full force, server load hardly even spiked. The Apache load can begin to increase in time, and simply because of the processes that can begin to pile up in the queue when you have someone firing 100 requests per second at it.

For this reason, I have it set to 3-failed attempts. It takes fewer resources to serve up a plain 403 error, than it does to respond with additional authentication tries to a vicious password hurler. Ok ok... Iím supposed to have a point to all this.

My point is that, from what Iíve seen, Mysql appears to handle IP processing quite nicely. Mind you, what weíre planning on doing with speed limits is a somewhat different application, but from what Iíve seen thus far, this could be a real possibility.

Dave

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved