Forum Moderators: open
on 2 occasions today I've been visited by someone(thing) who came in blind (no referrer) same IP and UA both times..
it's even trying to download bookmark links, and pulls all pages within 2 minutes
snippet from logs:
2003-09-09 17:22:12 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /default.asp - 200 0 0 130 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1) - -
2003-09-09 17:22:46 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /default.asp - 200 0 0 161 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)
2003-09-09 20:14:57 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /dir/dir/foo3.htm - 404 2 4203 206 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)
2003-09-09 20:15:04 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /dir/dir/foo5.htm - 404 2 4203 206 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)
The IP is 67.40.183.12
could it be a human using browser accelerator software?
Puzzled is all, the reason I noticed it is because it requested more pages than I actually have on my site, which isn't a problem just yet, but I'm trying to imagine the bandwith issue if it were a large site...
Suzy
copy paste from logs..
so this is what it whould look like (another entry)
compatible;+MSIE+6.0;+Windows+NT+5.1
and this is what it says
compatible+;+MSIE+6.0;+Windows+NT+5.1
So why would someone do this?
Should I ban it?
Suzy
OrgName: U S WEST Internet Services
OrgID: USW
Address: 950 17th Street
Address: Suite 1900
City: Denver
StateProv: CO
NetRange: 67.40.0.0 - 67.42.255.255
- so it's some customer at this ISP. I recognize the name as i've seen their customers cause trouble before. I'm not in the US either (i guess from the "UK" that you're also not), so they have some bots running wild around the globe occasionally.
Just ban it, i'd say.
/claus
The plus signs need to be escaped, i think this rewrite condition will catch it (you don't need the whole string, just the significant parts), the rule will ban it:
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\+(compatible\+;\+
RewriteRule (.*) - [F,L] "(.*)" to catch all requests, "-" for no rewrite of URLs, "F" for forbidden, "L" for last rule to apply to this condition (=this User-Agent)
Just add it to your .htaccess file. If you don't have other rewrites already you may need to add this line before the other two:
RewriteEngine on - it just tells the server that it needs to accept rewrite conditions and -rules. It's not even necessary in all cases, it depends on the server configuration, afaik ;)
/claus
"^Mozilla/\d\.\d\s\(compatible\s;\sMSIE\s\d\.\d;\sWindows\sNT\s\d\.\d\)$"
I'm not sure as this is working off copy/pasted function (I'm learning ;)). This function already has pattern matches in it.. but they are all using ' \s ', a "space" rather than the actual "+" sign.. is that because it's not htaccess?
I'm also not sure if the code I have is able to pick out part of a string.. but there are other rules which just search for one word so I suppose it does..
so let's see if I get it.. if it starts with '^' that tells it to match at the start of a string and if it ends with $ that tells it to match at the end of a string...
\s - is whitespace
\d -? match any number?
(where would I find a good resource for other "special characters")
so I'm thinking I could do this:
"^Mozilla/\d\.\d\s\(compatible\s;\sMSIE"
or
"\(compatible\s;\sMSIE"
would even do?
It hasn't been back today yet.. so I don't know how I'm getting on..
These regex's are fun ;)
Now I think I actually could read those htaccess "ban lists" and understand them for a change!
Suzy
The plus sign is a special character that means "match the preceding 'whatsit' 1 or more times" so it need to be escaped with a backslash first in order to be recognized as a plus: \+
Jim has corrected me on this issue before, i recall: It seems that not all regular expressions will be accepted in rewrite-conditions - so, will the "digits" shortcut?
Here's a great page for you with all those regexps on: [perl.com...]
And this one is also a must-have (regexp tester): [regexlib.com...]
>> would even do?
Yes it's good to focus on the important part, just replace the "\s" with "\+" - i suppose you could even do this:
\+;\+ The start and end anchors (^ and $) are useful as they speed the evaluation of the expression up, that's why i included one (the start) and not the other - i figured the speed gained from omitting part of the string would compensate for not having the full string and the end anchor.
/claus
[msdn.microsoft.com...]
This page in particular, seems like a TOC for the language (like the first link i posted, only structured differently):
[msdn.microsoft.com...]
I browsed three sections to find that \s \d and + were identical in behavior to the Perl version.
those links are gonna be well used from now on ;)
Right I think I've got the (IIS) difference though.. the pattern matching is taking place after a (VB) HTTP request (for the user agent, but I suppose I could add in IP's eventually! ;)) so there is no '+' sign in the returned result.. that must only be in the Raw log files?
So that would explain the \s.. part ;)
Thanks again for the crash course and links.. I'm off to update the list now.. the code I got yesterday is probably not updated..
Suzy :)
It would be interesting to know though, if i ever have to interpret an IIS log file.
/claus
Well here's my understanding, and please someone CMIIW as I don't understand htaccess, between us both we're bound to get there ;)
Those snippets of code I posted are direct from my hosts RAW IIS Log files, and they include the '+' sign..
However the script that I'm using is in an include file (to bypass my host?) and that is called on all pages so therefore it's not reading the log files (does htaccess?), instead it's issuing a direct Request.ServerVariables call, now when this call is made and I test with a response.write you don't get the '+' you get a space, so I'm presuming that this is then what I have to work with.
Like I say I'm fairly new to this too, but the script seems to be working like a dream just now..
Oh and btw I read that 20+ page thread earlier, so have gone with the trimmed version as opposed to trying to add too much ;)
I could post the script link here if interested, if allowed?
Suzy
>> Request.ServerVariables
I know a little bit of VBA and VBScript so i've got a feeling for how the MS syntax works. This command simply reads the "ServerVariables" that come with the "Request".
It's just like .htaccess in that respect, as .htaccess reads the "Environment Variables" that comes with the Requests. I'm pretty sure that SV and EV are just two names for the same set of informations. The .htaccess is not a script but a separate file (that can hold some script-like conditions). It is envoked for every request to the Apache server, but it does not need to be included in any file as a script call (as an example you can't do that with images, so thatīs just good) - the typical Apache server configuration makes sure that this file is always read if it's there.
Log files - both on Apache and IIS - are written after the request has been handled. These files records the who, what, and when, and they also record the server response in bytes and a status code (mostly 200 for OK, or another one starting with the number two i hope ; )
So, when you test your Request.ServerVariables with your own browser and response.write, you get the environment/server variables that your browser sends to the server - before they are written to the log file. These variables can be manipulated in any way (including inserting plus signs for spaces) before they are being written to the log files, but usually they don't (at least not on Apache - i'd think it was the same on IIS as this is a matter of using server ressources)
Your browser - being a real browser and not a bot, will send off spaces in the User-Agent string. These spaces will be included in the ServerVariables. Spaces will also be what is recorded in the log files, unless the IIS replaces all spaces in all user-agent strings with plus signs before writing them to the logs. Apache does not do this unless it's specifically pgogrammed to do so, which i find hard to believe that anyone would want to do.
It should be easy to test this. Just find any user-agent string in your log files that contain spaces, and you know that IIS does not do this. So, if the bad bot is the only one having plus signs in stead of spaces, this is also what it sends out in the user-agent string (which is, in turn catched by your script and the request.ServerVariables). If this is true, you should use the plus sign (\+) in stead of the space (\s).
On the other hand, if all entries in your log files have plus signs in stead of spaces, then IIS does replace all spaces with plus signs before writing them to the logs.
Then, this bot may send off spaces, and it may send off plus signs. Most likely it will send off spaces, as the string is clearly made to imitate a real browser. The log-rewrite, however, makes it hard to rule out that it in fact sends out plus signs in stead (as you would not know the difference when viewing the logs).
If all User-Agent entries in your log files have plus signs in stead of spaces, you should clearly use the space (\s) in stead of the plus (\+). Then, if the bot comes back and your rule does not catch it, you will know that it uses plus signs in stead of spaces.
>> that 20+ page thread
I'm impressed that you read it, as you are on IIS, but it's definitely good, as there are many good posts and a lot of bot knowledge that can be applied to IIS as well :)
>> post the script link here if interested, if allowed?
I'd be interested in seeing it, just out of curiosity, but i guess it's not a good idea to post the link due to the TOS. Is it possible to post just the essential part of the code itself (the part that reads the server variables, matches this one bot, and then bans it (serve a 403 forbidden)?
/claus
the script is a bit long to post so I'll just post the path to the original text file (not mine!) (de-linked):
http:*//evolvedcode.net/content/code_crawlerfilter/code.txt
and here is the line I added you'll see where..although I removed some of the others just now ;)
UA_Add "^Mozilla/\d\.\d\s\(compatible\s;\s", sUserAgentList
I though it better this way as you can see that all the other UA's are listed with \s,
Ahhh.. I see your point that a UA could possibly spoof the '+'.. so I checked up on the UA in my stats programme, and it has spaces in it. So thanks for pointing that out!
Suzy