Forum Moderators: open
msnbot-65-55-165-15.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
Are these actually deceptive "cloak detectors"? Hmm. Here are just some of the cloaked UAs mentioned in recent threads:
From: "MSN's cloak-crawling again: Twitter / Tweets [webmasterworld.com]"
70.37.13.98
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
From: "Mozilla/4.0: MSN strikes (out) again. [webmasterworld.com]"
65.55.234.160
Mozilla/4.0
From: "MSN fakes referrers [webmasterworld.com]" (see thread for loads more)
msnbot-65-55-104-70.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
msnbot-65-55-104-60.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
Last but not least...
Here's the Official Word on MSNBot: "Bing Webmaster Center Help [help.live.com]". As of this post, "The web crawler used by Bing is also known as MSNBot" -- a.k.a.:
msnbot
msnbot-media
msnbot-newsblogs
msnbot-products
There's nary a hint of the countless cloaked, bot-acting UAs hailing from bare MSN IPs and .search.msn.com. Looks like when it comes to our own sites, we're not supposed to fool them, but it's okay for them to fool us. Tsk.
Note that on the site hit, the ONLY bot-okay files are html. ALL graphics, CSS and JS files AND directories are specifically disallowed in robots.txt. Also, before AND after the following visits, two versions of msnbot asked for, and heeded, robots.txt using different IPs (typical):
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
A. Hit as IP only. Bypassed root/home and went for one page where it ONLY took CSS and JS files, no graphics:
65.55.110.184
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
robots.txt? NO
B. Forty minutes later, here we go again hitting the same page, but this time with rDNS, and using a search UA:
msnbot-65-55-106-184.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
robots.txt? Yes
I am SO tired of recoding htaccess/mod_rewrite conditions to curb MSN's violations, only to have SCORES of disallowed files appear in MSN/Live/Bing SERPs again and again (and right now, dammit), and after having repeatedly requested by special form for those files to be removed.
They keep blaming my robots.txt file, which uses multi-user-agent policy records and therefore causes their primitive "robot.txt tester" to fail. But the fact is that their real 'bots have no trouble with it; They crawl where allowed, and generally do not crawl where Disallowed; It's not a crawling problem, it's an indexing problem ("Some results have been removed" message and site not listed in the SERPs for its own name). Unfortunately, whenever I've called them, I've had to spend 30 minutes explaining this every time...
But enough about why "I am SO tired" of them... You can easily put a stop to your specific problem with something like:
RewriteCond %{REMOTE_ADDR}>%{HTTP_USER_AGENT} ^65\.55\.110\.[0-9]+>Mozilla/4\.0\ \(compatible;\ MSIE
RewriteRule ^ - [F]
Jim
IP: 65.55.165.nnn
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
NOTE: Every space in that UA is a double space so obviously not genuine MSIE.
Of course, the hit was blocked.
We're in the same camp as you and we're thinking about blocking then altogether. They grab nearly every page from our site every day - that's 1,400 pages per day.
We're on a US based server, with a US based IP address and a dot com website name. For some reason MS believes we're located in a different country.
msnbot-65-55-104-162.search.msn.com
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
10/19 08:36:33/dirA/filenameB.html
msnbot-65-55-104-67.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
10/19 08:37:10/dirA/filenameB.html
(I block the latter, to no ill effect. Yet.)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
65.55.110.*
msnbot-65-55-110-*.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707)
65.55.110.*
No rDNS
robots.txt? NO
As mentioned by "thetrasher" here [webmasterworld.com]: Azure, an AWS competitor [webmasterworld.com]
A.k.a.:
NetRange: 70.37.0.0 - 70.37.191.255
CIDR: 70.37.0.0/17, 70.37.128.0/18
NetName: MICROSOFT-DYNAMIC-HOSTING
A.k.a.:
Ugh.
But despite retrieving robots.txt seemingly a gazillion times a week, MSN's 'identified' bots routinely try to go where they're specifically disallowed. (Aside: Their cloaked bots don't request robots.txt at all.) And Bing's SERPs still contain disallowed links/info despite multiple special requests to remove same, depite the info being clearly disallowed.
Well, at least the fake referer thing seems to have died down... (crosses fingers)
@tangor: Age-wise, Bing's not really a noob; it's just the newest iteration/incarnation of MSN's official engines: Live Search, Windows Live Search, and MSN Search. The latter even shares Google's birth year: 1998.
Regardless, when MSN (or any SE) IDs its bots and they read/heed my robots.txt, they're welcome. Alas, years of server logs compel my trust-but-verify POV that MSN's engines/bots will do whatever, wherever, however.
Summary: Out of 13 MSN search-related hosts/hits, 7 (or 54%) were w/ cloaked UAs.
Details, Details:
msnbot-65-55-104-162.search.msn.com
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
12:01:40 - OKAY
msnbot-65-55-207-131.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:02:38 - OKAY
12:25:12 - OKAY (robots.txt)
12:26:13 - OKAY
msnbot-65-55-104-53.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
12:12:41 - CLOAKED UA (& file disallowed by robots.txt, .htaccess X-Robots-Tag, & META)
12:12:42 - CLOAKED UA (ditto)
cosmos.cosmosblu.search.live.net
Mozilla/4.0
12:13:18 - CLOAKED UA (but okay because asked for robots.txt)
msnbot-65-55-104-67.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
12:18:44 - CLOAKED UA
12:42:47 - CLOAKED UA (& looked for robots.txt in wrong place: /subdir)
msnbot-65-55-207-22.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:34:37 - OKAY
msnbot-65-55-104-60.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
12:41:31 - CLOAKED UA
12:41:36 - CLOAKED UA
msnbot-65-55-106-134.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
12:45:00 - OKAY
(Okay! Enough that's procrastinating for me for one day:)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
msnbot-65-55-110-219.search.msn.com
09:17:37
msnbot-65-55-110-63.search.msn.com
10:37:05
65.55.109.148
11:22:14
11:22:15
msnbot-65-55-109-35.search.msn.com
11:38:20
Zero requests for robots.txt -- not that MSN bothers to heed it nowadays.
I'm *this* close to finally rewriting everything MSN to home, only allowing that page and robots.txt, regardless of UA. (Currently only msnbot-related UAs from confirmed MSN servers are allowed to go further.) Bing's results are full of the site's do-not-hit/index/cache/follow URLs anyway. And the 'new, improved' Webmaster tools are (still) abysmal. (sighs)
MSN just does anything they want: cloaking, wget, UA spoofing, research UAs, no UA, various HTTP versions, refer spoofing ad infinitum.
They're playing hard-ball w/ Google and we're the ball.
I tried using robtex /24 to check ranges but a lot of them didn't appear, presumably because it wasn't hitting authorative servers.
Does anyone have a good dig command that retrieves rDNS (only!) for whole /24 blocks? Linux novice as far as a lot of it goes, especially dig.
@All: Here's one more BIG set of MSN domains that's suddenly heavy-hitting. Here's a very limited listing of players:
1.) UA? NO. robots.txt? NO --
tide16.microsoft.com [205.248.102.81]
tide501.microsoft.com [131.107.0.71]
tide531.microsoft.com [131.107.0.101]
tide536.microsoft.com [131.107.0.106]
(etc.,etc.,etc.)
2.) UA? Yes. robots.txt? NO --
tide613.microsoft.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)
3.) UA? Yes. robots.txt? Yes --
tide533.microsoft.com
Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)
-----
The "tide" servers used to be known as MSN employee-only, and I'd see one of them maybe, oh, once a month. Then this week, wham! Scores of them all over us like any other MSN spawn. Here's an example of a tide hitting within ~90 seconds of msnbot. Coincidence? Nah.
msnbot-65-55-110-221.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
11/23 07:10:54 /very-specific-filename.html
tide504.microsoft.com
(no UA)
11/23 07:12:34 /very-specific-filename.html
-----
Tides also joined the ranks of Twitter fellow travelers.
65.55.110.210
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
11/26 17:10:30 /dir-Allowed
msnbot-65-55-108-185.search.msn.com
Mozilla/4.0
11/26 17:13:20 /robots.txt
msnbot-65-55-110-217.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
11/26 17:32:46 /dir-Disallowed/filetype-Disallowed
(Those three were cloaked one way or another.)
msnbot-65-55-207-94.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
11/26 18:35:47 /robots.txt
(In terms of correct UA and/or Host ID and conduct, that last one was the only well-behaved bot.)
msnbot-65-55-4-150.search.msn.com
T-Mobile Dash Mozilla/4.0 (compatible; MSIE 4.01; Windows CE; Smartphone; 320x240; MSNBOT-MOBILE/1.1; +http://search.msn.com/msnbot.htm)
robots.txt? Yes