MSN's many cloaked bots. - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

MSN's many cloaked bots.

Mass undocumented activity in search.msn.com ranges

1
2
»

Pfui

5:46 pm on Sep 20, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What with the so many ongoing threads about MSN's msnbot-related crawlers and their various (mis)behaviors, I wasn't sure where to put yet another example of a cloaked UA. So here's a new thread containing a bunch of of MSN's stealth UAs, including this one I just found prowling around, and in a CGI-related directory that's explicitly denied to all bots six ways to Sunday:

msnbot-65-55-165-15.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)

Are these actually deceptive "cloak detectors"? Hmm. Here are just some of the cloaked UAs mentioned in recent threads:

From: "MSN's cloak-crawling again: Twitter / Tweets [webmasterworld.com]"

70.37.13.98
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

From: "Mozilla/4.0: MSN strikes (out) again. [webmasterworld.com]"

65.55.234.160
Mozilla/4.0

From: "MSN fakes referrers [webmasterworld.com]" (see thread for loads more)

msnbot-65-55-104-70.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)

msnbot-65-55-104-60.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

Last but not least...

Here's the Official Word on MSNBot: "Bing Webmaster Center Help [help.live.com]". As of this post, "The web crawler used by Bing is also known as MSNBot" -- a.k.a.:

msnbot
msnbot-media
msnbot-newsblogs
msnbot-products

There's nary a hint of the countless cloaked, bot-acting UAs hailing from bare MSN IPs and .search.msn.com. Looks like when it comes to our own sites, we're not supposed to fool them, but it's okay for them to fool us. Tsk.

Pfui

3:00 pm on Oct 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Might as well add another Microsoft mashup to the list. Yep, 65.55. Again. Cloaking. Again.

Note that on the site hit, the ONLY bot-okay files are html. ALL graphics, CSS and JS files AND directories are specifically disallowed in robots.txt. Also, before AND after the following visits, two versions of msnbot asked for, and heeded, robots.txt using different IPs (typical):

msnbot/1.1 (+http://search.msn.com/msnbot.htm)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)

A. Hit as IP only. Bypassed root/home and went for one page where it ONLY took CSS and JS files, no graphics:

65.55.110.184
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)

robots.txt? NO

B. Forty minutes later, here we go again hitting the same page, but this time with rDNS, and using a search UA:

msnbot-65-55-106-184.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)

robots.txt? Yes

I am SO tired of recoding htaccess/mod_rewrite conditions to curb MSN's violations, only to have SCORES of disallowed files appear in MSN/Live/Bing SERPs again and again (and right now, dammit), and after having repeatedly requested by special form for those files to be removed.

jdMorgan

8:25 pm on Oct 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Don't complain too much, or they may ban you... They did me, and I still can't get any straight answers from them. I suggest that it would be wisest to have no contact with them complaining specifically about their handling of your site. I don't know if there is a 'policy' at work here or if I just annoyed one employee, but it doesn't really matter, because if it was just one employee, then that one employee was empowered/allowed to nuke the site in a way that causes the first-line tech support people to be unable to find any problem...

They keep blaming my robots.txt file, which uses multi-user-agent policy records and therefore causes their primitive "robot.txt tester" to fail. But the fact is that their real 'bots have no trouble with it; They crawl where allowed, and generally do not crawl where Disallowed; It's not a crawling problem, it's an indexing problem ("Some results have been removed" message and site not listed in the SERPs for its own name). Unfortunately, whenever I've called them, I've had to spend 30 minutes explaining this every time...

But enough about why "I am SO tired" of them... You can easily put a stop to your specific problem with something like:


RewriteCond %{REMOTE_ADDR}>%{HTTP_USER_AGENT} ^65\.55\.110\.[0-9]+>Mozilla/4\.0\ \(compatible;\ MSIE
RewriteRule ^ - [F]

if you feel that blocking these requests would be desirable and/or wise.

Jim

dstiles

8:38 pm on Oct 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Got an "anon" hit today from a "real" msnbot IP (and not the first this month) to a contact form specifically disallowed in robots.txt - all my forms are.

IP: 65.55.165.nnn
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

NOTE: Every space in that UA is a double space so obviously not genuine MSIE.

Of course, the hit was blocked.

BillyS

1:50 am on Oct 12, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

jdMorgan -

We're in the same camp as you and we're thinking about blocking then altogether. They grab nearly every page from our site every day - that's 1,400 pages per day.

We're on a US based server, with a US based IP address and a dot com website name. For some reason MS believes we're located in a different country.

Pfui

5:44 pm on Oct 19, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

MSN still faking it. Below are hits to exact same file~ 20 seconds apart using two different UAs -- one marked bot, one not:

msnbot-65-55-104-162.search.msn.com
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
10/19 08:36:33/dirA/filenameB.html

msnbot-65-55-104-67.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
10/19 08:37:10/dirA/filenameB.html

(I block the latter, to no ill effect. Yet.)

GaryK

4:23 pm on Oct 21, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

My log files from last week were cluttered with this stuff too, but only two unique UAs. It doesn't read robots.txt. I don't know if this is worth of note or not, but it takes only disallowed files, as if it has robots.txt cached somewhere and is intentionally ignoring it.

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
65.55.110.*
msnbot-65-55-110-*.search.msn.com

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707)
65.55.110.*
No rDNS

Pfui

7:38 pm on Nov 13, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

70.37.65.66
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

As mentioned by "thetrasher" here [webmasterworld.com]: Azure, an AWS competitor [webmasterworld.com]

A.k.a.:

NetRange: 70.37.0.0 - 70.37.191.255
CIDR: 70.37.0.0/17, 70.37.128.0/18
NetName: MICROSOFT-DYNAMIC-HOSTING

A.k.a.:

Ugh.

incrediBILL

10:19 pm on Nov 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

They keep blaming my robots.txt file

The solution to your problem is both simple and elegant, use a dynamic robots.txt file.

When msnbot comes knocking, serve up just a robots.txt file only for msnbot.

Should clear up that indexing problem in no time unless they're just liars.

Pfui

1:14 am on Nov 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

FWIW... I do that -- serve up a dynamic robots.txt file via CGI -- and msnbot and its kin get their own specific version.

But despite retrieving robots.txt seemingly a gazillion times a week, MSN's 'identified' bots routinely try to go where they're specifically disallowed. (Aside: Their cloaked bots don't request robots.txt at all.) And Bing's SERPs still contain disallowed links/info despite multiple special requests to remove same, depite the info being clearly disallowed.

Well, at least the fake referer thing seems to have died down... (crosses fingers)

smallcompany

3:37 am on Nov 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

MSN's msnbot-related crawlers

Why would MSN do cloaking? Why going to disallowed stuff? Why all that trouble?

Is it that they don't know? Or, what's the benefit?

tangor

5:05 am on Nov 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I think Bing is experiencing (excercising) the same growing pains G did a few years back. Might be unhappy now (instant), but will probably be welcomed year next or the year after that.

Pfui

6:25 pm on Nov 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@smallcompany: I don't know why, just that they do. Every single day I see the multiple bare IPs (no rDNS) and multiple cloaked UAs referenced in my OP.

@tangor: Age-wise, Bing's not really a noob; it's just the newest iteration/incarnation of MSN's official engines: Live Search, Windows Live Search, and MSN Search. The latter even shares Google's birth year: 1998.

Regardless, when MSN (or any SE) IDs its bots and they read/heed my robots.txt, they're welcome. Alas, years of server logs compel my trust-but-verify POV that MSN's engines/bots will do whatever, wherever, however.

Pfui

9:41 pm on Nov 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

P.S. a.k.a. "45 Minutes in the Life of One Site"

Summary: Out of 13 MSN search-related hosts/hits, 7 (or 54%) were w/ cloaked UAs.

Details, Details:

msnbot-65-55-104-162.search.msn.com
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
12:01:40 - OKAY

msnbot-65-55-207-131.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:02:38 - OKAY
12:25:12 - OKAY (robots.txt)
12:26:13 - OKAY

msnbot-65-55-104-53.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
12:12:41 - CLOAKED UA (& file disallowed by robots.txt, .htaccess X-Robots-Tag, & META)
12:12:42 - CLOAKED UA (ditto)

cosmos.cosmosblu.search.live.net
Mozilla/4.0
12:13:18 - CLOAKED UA (but okay because asked for robots.txt)

msnbot-65-55-104-67.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
12:18:44 - CLOAKED UA
12:42:47 - CLOAKED UA (& looked for robots.txt in wrong place: /subdir)

msnbot-65-55-207-22.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:34:37 - OKAY

msnbot-65-55-104-60.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
12:41:31 - CLOAKED UA
12:41:36 - CLOAKED UA

msnbot-65-55-106-134.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:45:00 - OKAY

(Okay! Enough that's procrastinating for me for one day:)

Pfui

8:29 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

MSN is really rattling my cage today. The times shown are hits to differentdynamic files, ALL of which are off-limits to all bots and have been for years. In fact, there were no robots.txt-allowed hits from MSN at all, just not-okay URIs. Also, all hits were ostensibly made using:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)

msnbot-65-55-110-219.search.msn.com
09:17:37

msnbot-65-55-110-63.search.msn.com
10:37:05

65.55.109.148
11:22:14
11:22:15

msnbot-65-55-109-35.search.msn.com
11:38:20

Zero requests for robots.txt -- not that MSN bothers to heed it nowadays.

I'm *this* close to finally rewriting everything MSN to home, only allowing that page and robots.txt, regardless of UA. (Currently only msnbot-related UAs from confirmed MSN servers are allowed to go further.) Bing's results are full of the site's do-not-hit/index/cache/follow URLs anyway. And the 'new, improved' Webmaster tools are (still) abysmal. (sighs)

dstiles

10:39 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Couple of days ago I went through my rather generous list of MS IPs checking for bot/nobot. Those that came up nobot are now vulnerable to blocking if the activity is at all suspect.

One day I'll find time to discover how to check rDNS using ASP. :(

Pfui

10:49 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I wish I could check rIP with Perl:)

keyplyr

8:57 pm on Nov 25, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Shortly after M$ decided to "seriously" gain market share in the search theater with Bing, msnbot et al, they started requesting most every file residing on my server, some 2x or 3x each and every day. Up until this point, Y! was the only one ignoring my file expiration settings.

MSN just does anything they want: cloaking, wget, UA spoofing, research UAs, no UA, various HTTP versions, refer spoofing ad infinitum.

They're playing hard-ball w/ Google and we're the ball.

dstiles

10:49 pm on Nov 25, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It may have just been me looking at very old DNS info but in checking out MS IP ranges I've found a whole load more than before that have correct msnbot rDNS.

I tried using robtex /24 to check ranges but a lot of them didn't appear, presumably because it wasn't hitting authorative servers.

Does anyone have a good dig command that retrieves rDNS (only!) for whole /24 blocks? Linux novice as far as a lot of it goes, especially dig.

Pfui

5:32 am on Nov 26, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@keyplyr, et al: Too bad we're Jai Alai, not Smurf balls:)

@All: Here's one more BIG set of MSN domains that's suddenly heavy-hitting. Here's a very limited listing of players:

1.) UA? NO. robots.txt? NO --

tide16.microsoft.com [205.248.102.81]
tide501.microsoft.com [131.107.0.71]
tide531.microsoft.com [131.107.0.101]
tide536.microsoft.com [131.107.0.106]
(etc.,etc.,etc.)

2.) UA? Yes. robots.txt? NO --

tide613.microsoft.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)

3.) UA? Yes. robots.txt? Yes --

tide533.microsoft.com
Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)

-----
The "tide" servers used to be known as MSN employee-only, and I'd see one of them maybe, oh, once a month. Then this week, wham! Scores of them all over us like any other MSN spawn. Here's an example of a tide hitting within ~90 seconds of msnbot. Coincidence? Nah.

msnbot-65-55-110-221.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
11/23 07:10:54 /very-specific-filename.html

tide504.microsoft.com
(no UA)
11/23 07:12:34 /very-specific-filename.html

-----
Tides also joined the ranks of Twitter fellow travelers.

Pfui

4:50 am on Nov 27, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We've thought MSN's bots exchanged robots.txt instructions amongst themselves. No longer. Cloaked MSN Hosts/UAs are ignoring robots.txt and directly accessing Disallowed directories and file types, e.g., PDFs:

65.55.110.210
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
11/26�17:10:30 /dir-Allowed

msnbot-65-55-108-185.search.msn.com
Mozilla/4.0
11/26�17:13:20 /robots.txt

msnbot-65-55-110-217.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
11/26�17:32:46 /dir-Disallowed/filetype-Disallowed

(Those three were cloaked one way or another.)

msnbot-65-55-207-94.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
11/26�18:35:47 /robots.txt

(In terms of correct UA and/or Host ID and conduct, that last one was the only well-behaved bot.)

Pfui

7:00 pm on Dec 28, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This isn't cloaked per se, but the UA is a new (to me) variation on a same-version, formerly mixed-case [webmasterworld.com] theme. Emphasis mine:

msnbot-65-55-4-150.search.msn.com
T-Mobile Dash Mozilla/4.0 (compatible; MSIE 4.01; Windows CE; Smartphone; 320x240; MSNBOT-MOBILE/1.1; +http://search.msn.com/msnbot.htm)

robots.txt? Yes

Pfui

9:17 am on Feb 12, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Wow. Look who/what ran amok tonight on one site for even longer than the following three-hour period. So much for rDNS ID. (Hyphens added for layout; no IPs obfuscated.)

207.46.199.44 - - [11/Feb/2010:20:31:23]
207.46.204.231 - [11/Feb/2010:20:53:02]
207.46.195.240 - [11/Feb/2010:21:03:51]
207.46.204.197 - [11/Feb/2010:21:21:40]
207.46.204.179 - [11/Feb/2010:21:40:52]
207.46.204.227 - [11/Feb/2010:21:42:32]
207.46.195.234 - [11/Feb/2010:21:44:51]
207.46.204.185 - [11/Feb/2010:21:48:39]
207.46.204.189 - [11/Feb/2010:22:15:35]
207.46.204.239 - [11/Feb/2010:22:22:55]
207.46.204.183 - [11/Feb/2010:22:31:06]
207.46.199.49 - - [11/Feb/2010:22:45:18]
207.46.204.195 - [11/Feb/2010:23:10:00]
207.46.199.49 - - [11/Feb/2010:23:23:52]
207.46.204.238 - [11/Feb/2010:23:52:20]

UA: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
robots.txt? Yes

tangor

9:25 am on Feb 12, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I like Bing. At this moment. Not going to worry about it. Willing to grant access as I'm getting a lot of benefit. 12 hits is not that much.

Pfui

1:04 am on Feb 13, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Even if Bing was a better mousetrap (which I don't think it is), 15 robots.txt hits in ~3 hours from any major SE, let alone MSN using essentially cloaked IPs, is totally unnecessary.

Note, too, that the list I provided does not include concurrent .search.msn.com-hosted hits.

Additionally, legit msnbot hits for the past 12 days total 2,917 or almost five times the hits from SE bot runner-up G's total.

(Aside: Considering how often msnbot hits sites, it's amazing Bing's results are significantly more limited than G's, and even with more junk in their SERPS -- they've NEVER responded to multiple, special form-completed, removal requests; and their Webmaster Tools are abysmal.)

Pfui

4:13 am on Mar 2, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Now, no UA at all. And no robots.txt --

70.37.164.92
-

03/01 19:42:56 /dirA/fileA.html
03/01 19:43:06 /dirA/fileA.html
03/01 19:43:17 /dirA/fileA.html
03/01 19:43:28 /dirA/fileA.html

Ironically, had it/they used an msnbot-legit UA and asked for robots.txt, the file they 403'd on x4 would've been A-OK the first time.

dstiles

9:47 pm on Mar 2, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is this the MS version of AWS?

70.37.0.0 - 70.37.191.255
MICROSOFT-DYNAMIC-HOSTING

keyplyr

11:28 pm on Mar 2, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Yup, that's Microsoft Azure
70.37.0.0/17

dstiles

3:59 am on Mar 3, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What about 70.37.128.0 - 70.37.191.255 - is that the same or some other purpose?

keyplyr

7:05 pm on Mar 3, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Not sure what this range encompasses.

Microsoft Online Services
70.37.128.0/23

This 42 message thread spans 2 pages: 42

1
2
»