MS Crawler Hiding as a Browser - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

MS Crawler Hiding as a Browser

incrediBILL

2:52 am on Dec 29, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Here's something that quacks like a browser from from the wonderful world of Microsoft IPs.

Once anything asks for robots.txt on the site it bags and tags you as a robot, and will now give you 403 forbidden forever until you answer the Turing test. Obviously a bot, doesn't seem to mind the 403's, never asks for images or css, just bangs it's head on some 403s and leaves.

65.52.0.229 - - [08/Dec/2011:07:04:28 +0000] "GET / HTTP/1.1" 200 4047 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:07:06:49 +0000] "GET /robots.txt HTTP/1.1" 200 731 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:09:20:38 +0000] "GET /some_page.php HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:16:00:06 +0000] "GET / HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [09/Dec/2011:05:23:59 +0000] "GET / HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [24/Dec/2011:11:56:39 +0000] "GET /faq.htm HTTP/1.1" 403 1625 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

keyplyr

4:41 am on Dec 29, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I've been seeing that and other covert activity from M$ ranges forever it seems, but lately even more.

DeeCee

6:43 am on Dec 29, 2011 (gmt 0)

10+ Year Member

Microsoft do this constantly but are not intelligent enough to add an MS identifier to their "pretend" agents.

Google does the same thing, to test your output to various types of "agents", such as types of mobile phones and other, but they at least almost always add some kind of 'Google' tag at the end, to show it is them.

What it means overall it that not only does each bot-owner come by once to scrape content, but they come by 1,2,3+ and more times to scrape the same content as different things, adding more server and network load. Testing iPhone, Blackberry, ..., ...

Add to that 'Google Preview', which I am expecting will have a Microsoft clone soon. Preview allows them to keep the average potential advertise-clicker user on their own site longer longer. Away from actual Internet content.

I get more and more annoyed with all these content and information scrapers every day.
Try watching when a blog entry is posted, connected to twitter.
It is a whole swarm of bots that arrive for many hours after each posting feeding off twitter. Most of them anonymous scrapers. I just a few hours minutes ago added Vocus, Inc (also owner of PrWeb) to my block lists. They are still hacking away, retrying over and over every few minutes, despite nothing but 403s in response. All that work, and with NO agent-string at all to show their identity to the average site owner.

I guess we should be happy Microsoft even bother to have an agent-string.

Pfui

1:11 pm on Dec 29, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I guess we should be happy Microsoft even bother to have an agent-string.

Ah, but MSN's bots routinely do not. Just yesterday:

IP: 65.52.32.151
UA: -

17:58:44 /
17:58:55 /
17:59:09 /
17:59:20 /
17:59:32 /
17:59:43 /
17:59:54 /
18:00:06 /
18:00:17 /
18:00:29 /
18:00:40 /

robots.txt? NO

The exact same 'no UA, no robots.txt, no referrer, 11-hits-to-same-file' pattern has been going on for years. [webmasterworld.com...]

DeeCee

4:32 pm on Dec 29, 2011 (gmt 0)

10+ Year Member

You are correct.
I think web-site owners have been getting too complacent about protecting their content against scrapers.
Whether those content scrapers claim to be a search engines of various kinds or not.

Everyone and their mother (and sisters and 6-month old baby-brothers) today think they should create another data-collector site, a site selling competitor spying, link tracker site, "monitor what is said about you" site, or other. Whatever they "claim" they are thinking of being at this particular time (usually a lie on their web-site, if it can be found). The problem is that we never really know what they will do now or in the future with all that content and/or information.

Anonymous agents DO NOT get in on my servers.

A simple

SetEnvIf User-Agent "^$" bad_bot="NoAgentString~impersonator"

with various types of blocks for the bad_bot environment variable kills off the blanks.

If you do not even want to tell me who you are, you do not get in.
Similarly, if you are too stupid to change the Agent-String from the default string in the public code library you used to create yet another scraper, you do not get in either.

All the "Jakarta Commons-HttpClient", "Zend_Http_Client", "RomeClient", and other anonymous bots gets killed too. Anyone wanting to scrape without an accurate Agent String gets killed.
So does all the human impersonators (crawlers with only "human" type strings), when caught.

If you want to steal other people's content, the least you can do is say who you are, and how letting you steal it will help the site owner.

The higher the percentage of site owners that block them, the less value their databases will have.
Missing information and missing sites dilutes the value of those datasets and will cost them customers.

dstiles

7:37 pm on Dec 29, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Bill - none of those UAs looks genuine to me. There are usually extra "fields" in them proclaiming (usually ad nauseam) what .NET etc version they are.

incrediBILL

6:55 am on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Technically it's a plain vanilla MSIE 7 install on Vista, highly unlikely, but possible.

My guess based on behavior is it's a screen shot tool but it seems unlikely they'd use that MSIE version or Vista for that matter.

Don't forget MS has some cloud computer service they resell if I'm not mistaken, like AWS, so I'm completely suspicious about whatever this stuff is at this point.

keyplyr

8:21 am on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Don't forget MS has some cloud computer service they resell if I'm not mistaken, like AWS

It's called Azure

70.37.0.0 - 70.37.191.255
70.37.0.0/17
70.37.128.0/18

lucy24

9:00 am on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Technically it's a plain vanilla MSIE 7 install on Vista, highly unlikely, but possible.

As long as we're talking about vanilla:

Is there the smallest iota of possibility that the related form

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

could ever be a human? I've only just become aware of it, because it hangs out in places that are already blocked by IP. But I'm perfectly happy to throw in the UA itself as a backup ;)

Staffa

1:41 pm on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Lucy, I see this UA mainly used by non-humans from china (blocked of course)

enigma1

2:12 pm on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What I found interesting is the MSIE 7.0 tag mentioned. I was thinking it may have something to do with the recent news, IE6 dropped below 1%? Surely they can influence stats to some degree and make sensational news.

Pfui

3:13 pm on Jan 9, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

1.) Spotted the OP's bot-as-browser UA post-tweet:

65.52.0.229
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

The tweeted link was to a plain.txt file denied to all SEs via robots.txt and is not included on MSN's sitemap.xml. The above hit 'browsed' straight to it.

2.) Here's yet another bot-as-browser from .search.msn.com:

msnbot-207-46-12-61.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707)

robots.txt? NO

That's just one more in an ever-lengthening list of bots-as-browsers from .search.msn.com Hosts and IPs.

3.) Microsoft's Dynamic Hosting IPs now also spawn the long-familiar (to me, at least) 'No UA, no robots.txt, no referrer, 11-hits-to-same-file' visits: "MSN's Stealth Missions" [webmasterworld.com...]

Pfui

2:53 am on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This just came in again, post-Tweet:

65.52.0.229
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

That's seven sightings of the same IP+UA between Bill's OP and my replies. Bill, were your hits post-tweet? Might this be a Twitter-specific thing?

lucy24

4:19 am on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

There's an addendum to the robots.txt standard, but it's written in invisible ink-- or set to {display: none} or equivalent. Robots wearing street clothes don't count as robots and therefore don't need to read, let alone obey, robots.txt. My plainclothes MSNbots always pick up two things: the page itself, and its associated piwik.js file in a roboted-out directory. I mean of course "try to pick up the js file".

Food for thought: Since MS is openly waiting for the day MSIE 6 disappears from the face of the planet* does that mean they're forced to dress their robots in MSIE 7 for verisimilitude?

* Or at least that part of it south of the 60th parallel. What the bleepity bleepity do locutions like "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB7.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C)" mean anyway?