Forum Moderators: open

Message Too Old, No Replies

What's this?

not a spider but it's "pulling" everything

         

SuzyUK

8:40 pm on Sep 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



can anyone explain this to me..

on 2 occasions today I've been visited by someone(thing) who came in blind (no referrer) same IP and UA both times..

it's even trying to download bookmark links, and pulls all pages within 2 minutes

snippet from logs:
2003-09-09 17:22:12 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /default.asp - 200 0 0 130 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1) - -

2003-09-09 17:22:46 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /default.asp - 200 0 0 161 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)

2003-09-09 20:14:57 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /dir/dir/foo3.htm - 404 2 4203 206 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)

2003-09-09 20:15:04 sttldslgw30poolJ12.sttl.uswest.net - 80 GET /dir/dir/foo5.htm - 404 2 4203 206 HTTP/1.1 Mozilla/4.0+(compatible+;+MSIE+6.0;+Windows+NT+5.1)

The IP is 67.40.183.12

could it be a human using browser accelerator software?

Puzzled is all, the reason I noticed it is because it requested more pages than I actually have on my site, which isn't a problem just yet, but I'm trying to imagine the bandwith issue if it were a large site...

Suzy

bull

9:18 pm on Sep 9, 2003 (gmt 0)

10+ Year Member



Mozilla/4.0+(compatible+;+/MSIE+6.0;+Windows+NT+5.1)

Seems to me to be some homemade bot, surely not MSIE 6. Are the "+" in your raw logfile? Additionally, a space or '+' character before the ';' I marked bold is not used by MSIE.

SuzyUK

6:50 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well spotted.. the additional plus sign is there.. you know I looked forever at that and didn't see it;)

copy paste from logs..
so this is what it whould look like (another entry)
compatible;+MSIE+6.0;+Windows+NT+5.1

and this is what it says
compatible+;+MSIE+6.0;+Windows+NT+5.1

So why would someone do this?
Should I ban it?

Suzy

ukgimp

7:46 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



is is a site ripper like Blackwidow?

wild wild guess ...

claus

7:58 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The IP says:

OrgName: U S WEST Internet Services
OrgID: USW
Address: 950 17th Street
Address: Suite 1900
City: Denver
StateProv: CO
NetRange: 67.40.0.0 - 67.42.255.255

- so it's some customer at this ISP. I recognize the name as i've seen their customers cause trouble before. I'm not in the US either (i guess from the "UK" that you're also not), so they have some bots running wild around the globe occasionally.

Just ban it, i'd say.

/claus

SuzyUK

8:22 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi guys, thanks muchly

>>ban it.. ;)
Well I'd better learn some regular expressions then. I'll give it a go..

>>UK, yep you guess right!

Suzy

bull

9:29 am on Sep 10, 2003 (gmt 0)

10+ Year Member



So why would someone do this?

Perhaps it wants your email addresses, but to me already the fact of spoofing an UA is reason enough to ban it.

claus

10:14 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> regexps

The plus signs need to be escaped, i think this rewrite condition will catch it (you don't need the whole string, just the significant parts), the rule will ban it:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\+(compatible\+;\+
RewriteRule (.*) - [F,L]

"(.*)" to catch all requests, "-" for no rewrite of URLs, "F" for forbidden, "L" for last rule to apply to this condition (=this User-Agent)

Just add it to your .htaccess file. If you don't have other rewrites already you may need to add this line before the other two:

RewriteEngine on

- it just tells the server that it needs to accept rewrite conditions and -rules. It's not even necessary in all cases, it depends on the server configuration, afaik ;)

/claus

jdMorgan

12:40 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SuzyUK,

Parentheses need to be escaped, too. No backreference is needed in the RewriteRule pattern, and [L] is redundant when used with [F]:


RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\+\(compatible\+\;\+
RewriteRule .* - [F]

Jim

SuzyUK

1:53 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks claus and JD, I'm on IIS (no htaccess) though but I have this

"^Mozilla/\d\.\d\s\(compatible\s;\sMSIE\s\d\.\d;\sWindows\sNT\s\d\.\d\)$"

I'm not sure as this is working off copy/pasted function (I'm learning ;)). This function already has pattern matches in it.. but they are all using ' \s ', a "space" rather than the actual "+" sign.. is that because it's not htaccess?

I'm also not sure if the code I have is able to pick out part of a string.. but there are other rules which just search for one word so I suppose it does..

so let's see if I get it.. if it starts with '^' that tells it to match at the start of a string and if it ends with $ that tells it to match at the end of a string...
\s - is whitespace
\d -? match any number?

(where would I find a good resource for other "special characters")

so I'm thinking I could do this:
"^Mozilla/\d\.\d\s\(compatible\s;\sMSIE"

or
"\(compatible\s;\sMSIE"

would even do?

It hasn't been back today yet.. so I don't know how I'm getting on..

These regex's are fun ;)
Now I think I actually could read those htaccess "ban lists" and understand them for a change!

Suzy

claus

2:26 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



wow, that's fast :) ..the \d is for digits allright, but the space and the plus is not the same (\s is space)

The plus sign is a special character that means "match the preceding 'whatsit' 1 or more times" so it need to be escaped with a backslash first in order to be recognized as a plus: \+

Jim has corrected me on this issue before, i recall: It seems that not all regular expressions will be accepted in rewrite-conditions - so, will the "digits" shortcut?

Here's a great page for you with all those regexps on: [perl.com...]

And this one is also a must-have (regexp tester): [regexlib.com...]

>> would even do?

Yes it's good to focus on the important part, just replace the "\s" with "\+" - i suppose you could even do this:

\+;\+

The start and end anchors (^ and $) are useful as they speed the evaluation of the expression up, that's why i included one (the start) and not the other - i figured the speed gained from omitting part of the string would compensate for not having the full string and the end anchor.

/claus


Added:
I'm not familiar with the IIS environment so there may be some differences in the way Microsoft has implemented Regular Expressions. I only know the original Perl version. However, i found this Microsoft ressource on .NET Regular Expressions, my best guess is that it's the same used in IIS:

[msdn.microsoft.com...]

This page in particular, seems like a TOC for the language (like the first link i posted, only structured differently):
[msdn.microsoft.com...]

I browsed three sections to find that \s \d and + were identical in behavior to the Perl version.

SuzyUK

3:20 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wow, thanks claus!

those links are gonna be well used from now on ;)

Right I think I've got the (IIS) difference though.. the pattern matching is taking place after a (VB) HTTP request (for the user agent, but I suppose I could add in IP's eventually! ;)) so there is no '+' sign in the returned result.. that must only be in the Raw log files?

So that would explain the \s.. part ;)

Thanks again for the crash course and links.. I'm off to update the list now.. the code I got yesterday is probably not updated..

Suzy :)

claus

7:18 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't quite understand, did the real user-agent have spaces or plus-signs? In a http-request a space usually will get translated to "%20" and not "+" so i'm a bit confused. Anyway, you sound convinced and i'm certain you know more about IIS than me, so i hope it's just something about the way the IIS works.

It would be interesting to know though, if i ever have to interpret an IIS log file.

/claus

SuzyUK

8:22 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hi /claus

Well here's my understanding, and please someone CMIIW as I don't understand htaccess, between us both we're bound to get there ;)

Those snippets of code I posted are direct from my hosts RAW IIS Log files, and they include the '+' sign..

However the script that I'm using is in an include file (to bypass my host?) and that is called on all pages so therefore it's not reading the log files (does htaccess?), instead it's issuing a direct Request.ServerVariables call, now when this call is made and I test with a response.write you don't get the '+' you get a space, so I'm presuming that this is then what I have to work with.

Like I say I'm fairly new to this too, but the script seems to be working like a dream just now..

Oh and btw I read that 20+ page thread earlier, so have gone with the trimmed version as opposed to trying to add too much ;)

I could post the script link here if interested, if allowed?

Suzy

claus

7:08 am on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm glad i get to know a little bit more about how the IIS works :) I'm still not 100% convinced if you should use spaces or plus signs though. Someone more experienced with IIS would probably know instantly, but as i'm only working with Apache i need to get the basics..hrm.. fundamentals right before i feel sure about it ;)

>> Request.ServerVariables

I know a little bit of VBA and VBScript so i've got a feeling for how the MS syntax works. This command simply reads the "ServerVariables" that come with the "Request".

It's just like .htaccess in that respect, as .htaccess reads the "Environment Variables" that comes with the Requests. I'm pretty sure that SV and EV are just two names for the same set of informations. The .htaccess is not a script but a separate file (that can hold some script-like conditions). It is envoked for every request to the Apache server, but it does not need to be included in any file as a script call (as an example you can't do that with images, so thatīs just good) - the typical Apache server configuration makes sure that this file is always read if it's there.

Log files - both on Apache and IIS - are written after the request has been handled. These files records the who, what, and when, and they also record the server response in bytes and a status code (mostly 200 for OK, or another one starting with the number two i hope ; )

So, when you test your Request.ServerVariables with your own browser and response.write, you get the environment/server variables that your browser sends to the server - before they are written to the log file. These variables can be manipulated in any way (including inserting plus signs for spaces) before they are being written to the log files, but usually they don't (at least not on Apache - i'd think it was the same on IIS as this is a matter of using server ressources)

Your browser - being a real browser and not a bot, will send off spaces in the User-Agent string. These spaces will be included in the ServerVariables. Spaces will also be what is recorded in the log files, unless the IIS replaces all spaces in all user-agent strings with plus signs before writing them to the logs. Apache does not do this unless it's specifically pgogrammed to do so, which i find hard to believe that anyone would want to do.

It should be easy to test this. Just find any user-agent string in your log files that contain spaces, and you know that IIS does not do this. So, if the bad bot is the only one having plus signs in stead of spaces, this is also what it sends out in the user-agent string (which is, in turn catched by your script and the request.ServerVariables). If this is true, you should use the plus sign (\+) in stead of the space (\s).

On the other hand, if all entries in your log files have plus signs in stead of spaces, then IIS does replace all spaces with plus signs before writing them to the logs.

Then, this bot may send off spaces, and it may send off plus signs. Most likely it will send off spaces, as the string is clearly made to imitate a real browser. The log-rewrite, however, makes it hard to rule out that it in fact sends out plus signs in stead (as you would not know the difference when viewing the logs).

If all User-Agent entries in your log files have plus signs in stead of spaces, you should clearly use the space (\s) in stead of the plus (\+). Then, if the bot comes back and your rule does not catch it, you will know that it uses plus signs in stead of spaces.

>> that 20+ page thread

I'm impressed that you read it, as you are on IIS, but it's definitely good, as there are many good posts and a lot of bot knowledge that can be applied to IIS as well :)

>> post the script link here if interested, if allowed?

I'd be interested in seeing it, just out of curiosity, but i guess it's not a good idea to post the link due to the TOS. Is it possible to post just the essential part of the code itself (the part that reads the server variables, matches this one bot, and then bans it (serve a 403 forbidden)?

/claus

SuzyUK

2:37 pm on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi \claus

the script is a bit long to post so I'll just post the path to the original text file (not mine!) (de-linked):
http:*//evolvedcode.net/content/code_crawlerfilter/code.txt

and here is the line I added you'll see where..although I removed some of the others just now ;)

UA_Add "^Mozilla/\d\.\d\s\(compatible\s;\s", sUserAgentList

I though it better this way as you can see that all the other UA's are listed with \s,

Ahhh.. I see your point that a UA could possibly spoof the '+'.. so I checked up on the UA in my stats programme, and it has spaces in it. So thanks for pointing that out!

Suzy

claus

3:32 pm on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks, i did like this line, i will include it in my own 403's right away:

We suspect you are using an automated process to access this website - please use a normal browser.

/claus