Forum Moderators: open

Message Too Old, No Replies

Banning Microsoft URL Control

Best method?

         

bouncybunny

11:57 am on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi

I'm using this method to ban bad bots.

SetEnvIfNoCase User-Agent "^badbot1" bad_bot
SetEnvIfNoCase User-Agent "^badbot2" bad_bot
SetEnvIfNoCase User-Agent "^badbot3" bad_bot
SetEnvIfNoCase User-Agent "^badbot4" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

But I'm a bit confused as to how best to ban a bot who is using Microsoft URL Control to hammer a script on my site. The more I research, the more confused I get.

Can I use;

SetEnvIfNoCase User-Agent "^Microsoft" bad_bot

Will this be enough, or am I likely to ban all things with Microsoft in the name?

Thanks

volatilegx

3:04 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Will this be enough, or am I likely to ban all things with Microsoft in the name?

It will only ban bots with "Microsoft" at the beginning of the name. It should work fine for your purpose.

GaryK

6:57 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yep, Dan is right. That'll work fine for anything that starts with Microsoft. That'll cover most of the nasty stuff from MSFT. There are a few others that I ban as well:

MFC Foundation Class Library*
MFHttpScan
MSN Feed Manager
MSProxy/*

The MSN bots pretty much all start with msnbot. At least all the ones that are confirmed as being from MSN. I have seen the following two bots coming from MSFT IP Addresses however my contact at MSFT tells me they are not official:

lanshanbot/1.0*
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot, crawler)

bouncybunny

7:47 am on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks guys. So it seems safe to ban all things starting with Microsoft?

I always wondered what lanshanbot/1.0* was. I've been banning it because it didin't have any contact information.

wilderness

8:58 am on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was required to spend some time looking at my data to determine what method is used to keep out the subject line.

I find remark statments just make things harder to read (my opinion) and don't generally use them.

The only references that I was able to find, were to four digit numbers that "ends with" is used.
These numbers were references in 2002 and still seem to be effective.
However, in all fairness, I don't believe that we are seeing the volume of Microsoft URL Control in the UA's that we once did?

bouncybunny,
could you possibly provide the FULL UA which would include the numbers that accompany the words?

Don

bouncybunny

4:43 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Don

I'm sorry, I have to be honest, I'm not a big techie and I find your post quite hard to understand.

But the full user agent for two of these visitors is;

Microsoft URL Control - 6.00.8862

Microsoft URL Control - 6.00.8169

There are a couple of others too, I think. To be honest, they are not grabbing huge sections of my site any more. Just apparently unconnected pages as well as some incorrect and apparently randomly created/incomplete URLs.

For example they might grab something like http://www.webmasterworld.com/foru

Very odd.

[edited by: encyclo at 2:06 am (utc) on April 14, 2007]
[edit reason] delinked broken link [/edit]

GaryK

4:54 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you asking me to supply the full line from my logs where the Microsoft URL Control user agent is? I.E. IP number and so on?

The UA refers to the User Agent. What he wants to see is the full UA, like: Microsoft URL Control - 6.00.8877.

I let all this garbage in cause I need to see what it's up to for my project.

I'm not sure about this UA being on the decline. The last time I saw it was April 8, 2007. It's visited close to 400 times since I first saw it several years ago.

EDIT: No problem BB. ;)

[edited by: GaryK at 5:04 pm (utc) on April 13, 2007]

bouncybunny

4:58 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Gary.

Sorry, I edited my post whilst you were posting yours. The light dawned on me. ;)

jdMorgan

5:08 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you're worried about making the scope of the denial too wide, just make the pattern more specific:

SetEnvIfNoCase User-Agent "^Microsoft\ URL\ Control" bad_bot

Here the UA must start with"Microsoft URL Control" but may be followed by a dash and revision numbers, etc. I'm not actually sure that the spaces have to be escaped as shown in a quoted string, but it won't hurt.

Jim

bouncybunny

5:14 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Jim.

I wasn't sure how to deal with the spaces between Microsoft + URL + Control. I'm still learning all this stuff.

wilderness

5:14 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks Gary.

The ending numbers were sufficient.

I have these old numbers denied

SetEnvIf User-Agent 30630$ keep_out
SetEnvIf User-Agent 0425$ keep_out
SetEnvIf User-Agent 47$ keep_out
SetEnvIf User-Agent 48$ keep_out
SetEnvIf User-Agent 51$ keep_out
SetEnvIf User-Agent 53$ keep_out
SetEnvIf User-Agent 63$ keep_out
SetEnvIf User-Agent 8862$ keep_out
SetEnvIf User-Agent 8877$ keep_out
SetEnvIf User-Agent 39)$ keep_out
SetEnvIf User-Agent 4319$ keep_out

Althoug I'm sure not all of them are related to URL Control.

Many thanks bouncybuuny.

wilderness

5:17 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure about this UA being on the decline. The last time I saw it was April 8, 2007. It's visited close to 400 times since I first saw it several years ago.

Gary,
Perhaps the software only visits where access is allowed (after the initial visit)?

Don

incrediBILL

9:18 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't think it's stand alone software, I think the user agent string "Microsoft URL control" is in some http connection class of a Windows library, kind of like the libwww-perl thing.

Looks like it's popular with scraping and email harvesters.

I would only block the entire user agent "Microsoft URL control" to avoid any accidental whacking of legit things from MS.

GaryK

12:37 am on Apr 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many thanks Gary.

You're welcome. :)

Perhaps the software only visits where access is allowed (after the initial visit)?

Nope. I have other sites where it is banned using a similar approach to .htaccess only for Windows (ISAPI_Rewrite) and it gets turned away regularly. As best I can tell it seems to target files that are likely to have e-mail addresses in them. It regularly requests the contacts pages and member profile pages on my sites. It doesn't actually get them <evil_grin> but it does try. On the projects site it's mostly used to download the files I'm offering to the public for free so I really don't care about it and it also gives me a chance to see what else it's up to.

kind of like the libwww-perl thing.

That's basically correct Bill. The UA is not part of a product from MSFT.

bouncybunny

1:36 am on Apr 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would only block the entire user agent "Microsoft URL control" to avoid any accidental whacking of legit things from MS.

Thanks. I've implemented jdMorgan's method;

SetEnvIfNoCase User-Agent "^Microsoft\ URL\ Control" bad_bot

and that seems to work a treat when I test it with Firefox and the User Agent Switcher plugin. It blocks "Microsoft URL control", but let's through "Microsoft".

I'm understanding that most of you agree that this UA should be banned in most cases? It seems to be a bit of an unknown from some of the threads I've been reading.

As best I can tell it seems to target files that are likely to have e-mail addresses in them.

In my case, visits from most IPs only requested a few pages here and there. But one IP repeatedly requested the same two pages dozens of times. These two pages are connected (i.e. page 1 and 2 of the same article), but do not contain any email addresses. They do, however, contain the names of many people, so perhaps this was interpreted as possible contact information?

wilderness

1:57 am on Apr 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



after many pages:

Microsoft URL Control is the user-agent given to applications that use the MSInet API under Visual C++ (source page not recorded, however the date was 1998)

Whether this is the same tool, another may determine

[microsoft.com...]

bouncybunny

2:02 am on Apr 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks.

thetrasher

10:13 am on Apr 14, 2007 (gmt 0)

10+ Year Member



kind of like the libwww-perl thing
ActiveX control - see post #3063810 from msndude:
[webmasterworld.com...]