Forum Moderators: phranque
When I test the site from my Win98 machine and run the MSIE “Make available offline” option, it seems that MSIE 5.0 displays as a normal browser as it rips its way through the site— it does not identify itself as Microsoft URL Control, or MSIECrawler. Instead, it identifies itself like this: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt).
Am I missing something here? I tried all of those mod_rewrite recipes, such as these two:
RewriteCond %{HTTP_USER_AGENT} ^.*microsoft.url.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*MSIECrawler.*$ [OR]
But nothing will stop the Win98 5.0 offline reader feature from ripping through my 21,000 page website. Last month, this browser alone sucked up 4 gigs of bandwidth.
I’m blocking by IP, but that’s becoming a real pain. Anyhow, is there some weird reason why the offline parser in MSIE 5.0 is spoofing itself as a normal browser? Is there a mod_rewrite recipe to block it? Thanks to the way Microsoft has managed this one, it appears as if this browser cannot be stopped unless you block it completely, but hey.. it never hurts to ask. Any help would be welcome.
Thanks
Dave
Welcome to WebmasterWorld!
I haven't seen that variant of the "view offline feature" User-agent. In most cases, blocking MSIECrawler, or disallowing it in robots.txt is sufficient. This "normal-looking User-agent" may be a IE5.0 or Win98 thing, though.
You might want to consider using a spider-trapping approach. This can be implemented with .htaccess, a simple PERL script [webmasterworld.com], and a few "spider trap" links salted around your site. You link to a URL that is disallowed in robots.txt, using a link (like a 1x1 transparent .gif) that is not visible to a human browser. Any request for that disallowed file is redirected to the script. The script grabs the requestor's IP address and adds it to a "deny from" directive at the top of your .htaccess file. It then outputs a small non-descript html page to complete the http request with a 200-OK response. However, on subsequent requests, the user-agent receives a 403-Forbidden response from your server, and can no longer access your site. Once a day/week/month, you go through the accumulated IP addresses in your .htaccess file, identifying blocks of related IP addresses that re-occur, and "compressing" the list - maybe using mod_rewrite to make the blocking more efficient for a particularly large or non-contiguous block of addresses. This takes a lot of the work out of blocking by IP.
Jim
<!--#config timefmt="%y%j%H%M%S" --> <!-- of date string -->
<!--#exec cmd="sleep 20" --> <!-- make this page sloooow to load -->
<a href=i<!--#echo var="DATE_GMT" -->.html>access codes</a>
remote admin <a href=j<!--#echo var="DATE_GMT" -->.html>information is here.</a>
email addresses <a href=k<!--#echo var="DATE_GMT" -->.html>are here</a>
<p><big><big><strong> </strong></big></big></p>
<p><big><big><strong>This is a private area. Unless you're an administrator or have
some other special reason to be here, please leave now!</strong></big></big></p>
<p><a href=l<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=m<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=n<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=o<!--#echo var="DATE_GMT" -->.html>.</a><br>
<a href=p<!--#echo var="DATE_GMT" -->.html>.</a><br>
I placed all directories with this code in my robots.txt file. This would clog them for hours as they hung there suspended in time :) Oh yeah.. And the links should be hidden- black on black for example.
In any event, thanks for the reply, however the problem is not site grabbers dishonoring the robots.txt file. In fact, since it’s Microsoft based, it obeys those rules. The problem is the amount of bandwidth they can draw when someone sets one loose on the site, and then goes off to bed.
They probably don’t realize the size of it, which I can understand, but hell… That’s costing me plenty of money, not to mention the beating it puts on the server. Someone doing this last week drove the load up to 4.10 until I blackholed them. I guess on a high-speed connection, your request rate is only limited by your throughput.
That script would be killer if there was a way to:
- Cache the IP’s of current users via an SSI call to the script
- Set a threshold of how many pages they can load per minute
Maybe have the script deny access if it detects X-number of access within X-number of seconds.
- Maybe ability to make legitimate spiders such as Google exempt from the rule set.
- Block offending IP for an hour, a day, or indefinitely.
Even the old Matt’s wwwboard had a module like this. You would install it if you wanted to limit the times between posts based on an IP. I wish I knew more about building my own scripts. I wonder though; would this even be possible, or would it put too much of a load on the server?
I mean.. we run all sorts of other SSI based calls, and they don’t see to slow things down. I suppose it would be reliant on how the script was written. At any rate, this continues to be a problem, and the means becoming more aggressive all the time. As I see it, the only way to shut this down is to take control over their access times.
I watch legitimate spiders all the time, and can never recall a maximum of 1-page per minute. Most of them are a page per 5-minutes, or in some cases several hours, so I can’t see a system like this messing them up. Anyway, I’m just babbling away here.
Dave
Your thoughts on recording IPs and limiting access echo some of my own. A module called SpeedLimit [modperl.com] is available for Apache server, but I don't have access to it on my hosts.
Maybe when I get better at scripting, I'll be able to do something like that myself. The main problem is one of scaling - An efficient way to search the previously-recorded IPs is needed, so that the system won't break down under heavy load. The critical path is the part of the code that runs while the recorded-IP file is open. In a multi-process environment such as a server, the file must be locked while it is open. This constrains the design of the IP database access code, requiring intense attention to efficiency.
I suppose a "poor man's" method might be to include your SSI "wait 20" - or more likely, "wait 2" - at the end of often-accessed pages. This would slow down a robot, but not a human.
I laughed at the "fake links page" idea... Hey, why not serve a bogus "Blue Screen of Death" page to MS user-agents? ;)
Jim
As for creating a script, in theory it all sounds quite simple, but after numerous searches and reading through the web and Usenet, I have found nothing that remotely resembles this. Hmm.. Could it be that no ones thought of it, or are there programming issues that cannot be overcome? You’re correct in that you’d need a way to efficiently cache numerous IP’s, then have the script monitor their usage, and then expire them after a certain amount of inactivity. Indeed, on a busy site, this could cause problems if not implemented properly.
I wonder.. I’m not a PHP guru, but perhaps a PHP/Mysql solution could also be a possibility here. I saw something like this last night, however the way in which the author explains it is somewhat confusing. It may be possible to create a PHP script, which logs all IP’s to a database. It may also be easier to manipulate, cache, and expire these IP’s under the Msql architecture. The recipe could consist of the main programming.php script, while an <--include virtual--> tag would be placed in your html pages to call it. Of course, if you’re running PHP pages, it’s even easier.
Queries? I’m currently experimenting with this on a smaller scale. I’m using a php, Mysql, and CGI solution to kill password hurlers. This script is so incredibly simple, yet powerful. When someone logs into the member section, they get 3-tries before being blocked completely. Here’s how it works:
1. User logs in a failed attempt
2. Error document is pointed to 403.php
3. 403.php sends the failed IP to a Mysql database
4. User tries two more times, and fails
5. 403.php checks this IP (now in the db) against the failed threshold- set to 3 failures
6. 403.php now outputs a new .htaccess file blocking this IP from additional attempts
7. An email is sent notifying me of the blocked IP
8. A cron job is run once every 24-hours to clear the DB
9. A second cron job every 24-hours, which outputs a new .htaccess file
10. Optionally, I can manually clear it by running the script at the shell
Stress test on the DB?
It’s been pounded with a password hurler, with “rotating” IP’s. After all, no one tries to hack a password with one IP. Here’s what I saw. Mysql and 403.php successfully blocked all proxies after 3 unsuccessful attempts. Load on the db server? Negligible. Actually, it was just sitting there smoking a cigar. Even at full force, server load hardly even spiked. The Apache load can begin to increase in time, and simply because of the processes that can begin to pile up in the queue when you have someone firing 100 requests per second at it.
For this reason, I have it set to 3-failed attempts. It takes fewer resources to serve up a plain 403 error, than it does to respond with additional authentication tries to a vicious password hurler. Ok ok... I’m supposed to have a point to all this.
My point is that, from what I’ve seen, Mysql appears to handle IP processing quite nicely. Mind you, what we’re planning on doing with speed limits is a somewhat different application, but from what I’ve seen thus far, this could be a real possibility.
Dave