msnbot/2.0b

Forum Moderators: open

Message Too Old, No Replies

msnbot/2.0b

New crawler from MS in the works?

GaryK

5:04 pm on Feb 1, 2009 (gmt 0)

msnbot/2.0b ( [search.msn.com...]
131.107.0.95
tide525.microsoft.com

Read robots.txt and then left, so I don't know if it actually obeys it or not.

Is this really a new msnbot in beta?

wilderness

9:13 pm on Feb 1, 2009 (gmt 0)

Gary,
In the event that MS has their entire bot services coming from that Class B, than, I'll simply be missing the benefits of MS.

That IP range has a history as bad or worse than many of the major colo's that many of us have denied.

Don

wilderness

12:24 am on Feb 2, 2009 (gmt 0)

Coincidence?

131.107.151.157 - - [01/Feb/2009:19:46:41 +0000] "GET / HTTP/1.0" 403 998 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
131.107.151.157 - - [01/Feb/2009:19:49:40 +0000] "GET / HTTP/1.0" 403 998 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

GaryK

4:12 am on Feb 2, 2009 (gmt 0)

That's good enough for me, Don. Thanks.

Pfui

6:03 pm on Feb 4, 2009 (gmt 0)

1.) Before I blocked it by version, msnbot/2.0b properly asked for robots.txt (where msnbot is allowed in many dirs) but then only requested a single file multiple times in numerous sessions yesterday:

filename.html/t/t/t/

No one/nothing else has ever looked for any file with that screwy 'suffix' (and Goo catches everyone's wonky links-to).

2.) I'm seeing different hosts than the one mentioned in the OP, and with a bare UA:

msnbot-65-55-115-175.msn.com
msnbot/2.0b

msnbot-65-55-115-151.msn.com
msnbot/2.0b

[edited by: Pfui at 6:09 pm (utc) on Feb. 4, 2009]

caribguy

6:11 pm on Feb 4, 2009 (gmt 0)

Apparently /t/t/t/ is the signature of a successfully 'sploited server...

[webmasterworld.com...]

GaryK

6:30 pm on Feb 4, 2009 (gmt 0)

For the sake of accuracy, that thread mentioned \t\t\t\t as the sign of an exploited server. Not sure if \t\t\t suggests the same thing. :)

caribguy

6:54 pm on Feb 4, 2009 (gmt 0)

Probably \t then...

Gary, I see that ip with

www.example.com 131.107.0.95 - - [28/Jan/2009:18:26:45 -0600] "GET /tools/widgets HTTP/1.1" 200 42598 "http://search.live.com/results.aspx?q=air+fare" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"

There's no way I'd rank on those search terms.

Also saw the ip/ua wilderness mentioned.

jdMorgan

9:24 pm on Feb 4, 2009 (gmt 0)

Some time ago, MSNdude posted here, and asked us (Webmasters) not to block the "tide" hostnames at MSN. I suspect that these hosts are used for search-results- and site- quality-checking (and probably cloaking detection in some cases) -- at least, that's the impression I was left with at the time. This explains the "air+fare" referrer mentioned in the previous post.

Those "bogus search queries" from MSN are what started the hubbub here, as a matter of fact.

I allow msnbots from that IP address range, as long as all aspects of their requests comport with past msnbot behavior.

YMMV,
Jim

GaryK

10:35 pm on Feb 4, 2009 (gmt 0)

Thanks, Jim. In this case though I have to wonder if the user agent was legit or not. Seems to me we'd hear if MSN was beta testing a new bot. Then again, maybe not. I'll keep an eye on it and see how it behaves during future visits.

jdMorgan

3:15 pm on Feb 8, 2009 (gmt 0)

OK, to add to the confusion, some bright boy at MS has apparently modified msnbot/2.0b to send different HTTP headers than previous versions.

It appears that the Accept header has changed to "*/*" and the historically-included Accept-Encoding and From Headers are now omitted.

I should note that rDNS was valid on the requests I'm basing these observations on. A typical hostname was msnbot-65-55-106-139.search.msn.com

Jim

GaryK

9:30 pm on Feb 8, 2009 (gmt 0)

Despite being asked to trust "tide" I remain unconvinced this is a legit msnbot. It was back again over the last few days. Not only didn't it obey robots.txt despite reading it, it fell into a bad bot trap and got itself banned. So until I see something official from MSN, and until it learns some manners I'll be suggesting this bot is spoofed and the "tide" net ranges are insecure.

GaryK

10:37 pm on Feb 8, 2009 (gmt 0)

Sorry for the double-post. It was too late to edit my previous reply. I'm back home now and scanning my log files analysis.

Is this another reason to not trust that stuff from tide*.microsoft.com is automatically a legit bot/crawler?

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET CLR 3.0.30618; .NET CLR 3.5.30428; InfoPath.2; MS-RTC LM 8; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; SLCC1; WWTClient2; SPC 3.1 P1 Tc)
131.107.0.106
tide536.microsoft.com

The pattern of files taken is that of a human, not a bot:

/main.css
/index.asp
/page-background.gif
/favicon.ico

And then a very similar user agent:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; GTB5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; Tablet PC 2.0; .NET CLR 1.1.4322)
122.38.247.3
No rDNS but belongs to Xpeed in Seoul, Korea.

PS: msnbot/2.0b+(+http://search.msn.com/msnbot.htm) returned yesterday and took robots.txt before leaving.

[edited by: GaryK at 10:50 pm (utc) on Feb. 8, 2009]

caribguy

11:28 pm on Feb 9, 2009 (gmt 0)

File under more Mickeysoft weirdness:

207.46.92.nn from 3 ip's looking for images and getting 403'd for the UA: "LWP::Simple/5.814" of all things :)

NetName: MICROSOFT-GLOBAL-NET
CIDR: 207.46.0.0/16

Pfui

8:19 am on Feb 10, 2009 (gmt 0)

I don't know if this is related to the v2.0b bot or not but --

If you restrict or wrangle MSN's bots in any way and it's been a while since you used Live.com's Webmaster Tools, you might want to login to their Webmasters section ASAP because you might discover your instructions ignored.

About 45 minutes ago, after seeing a search referer that should never-ever have been a referer, I investigated and found cached pages galore -- when ALL caching is blocked by per-page META -- plus 144 links to a directory where ALL pages and dynamic files are blocked by per-page/post META AND robots.txt, by directory and file type. In a word: Dammit.

The removal process is a far, far cry from Google's handy tool: "In the future, we may provide an automated tool for these requests...". Currently, you have to fill out a form and include X, Y, and Z bits of info. Then you're given a Support Ticket Number and:

"Once we have received your request, we will process the request to remove the URL within 48 hours of the request being accepted."

One can but hope they (re)start heeding the instructions they tell us to give them, and keep heeding same until we instruct otherwise. If not, I see no reason to allow any version of msnbot and its ilk to access my sites because the time and trouble I spend tending to/after their bots simply ain't worth the slim traffic.

GaryK

8:25 pm on Feb 10, 2009 (gmt 0)

What's the point in disallowing files if they're gonna get included anyway. I expect that from some bots, but not the big three.

Samizdata

9:41 pm on Feb 10, 2009 (gmt 0)

What's the point in disallowing files if they're gonna get included anyway.

Most botmasters really hate to be told "no" - it's pathological.

At the lower end of the market you see it in outright disobedience.

The middle ground is occupied by those who come up with all manner of excuses as to why robots.txt restrictions do not apply to them, or who interpret the exclusion protocol to suit themselves (as in "we will always take the home page because that is not crawling").

In my experience, Google is top of the range for compliance - but even their bots' (general) adherence to the rules is really only for public relations, and they will still use automated processes with other UAs to fetch files that are disallowed, not least because if they didn't then gaming their algorithm would be too easy.

Like many things in life, it's a mixture of charade and farce.

In a Utopian cyberspace there would be an enforceable robots exclusion protocol.

But we operate in a jungle, and must accept the realities.

...

dstiles

11:22 pm on Feb 10, 2009 (gmt 0)

It's reasonably simple to detect the big three. If one of them decides to disobey robots.txt then the simple answer is to terminate the page before anything is displayed and return an error code of choice. Is there a code that says, "*!%$* off MS"? :)

Pfui

10:13 pm on Feb 12, 2009 (gmt 0)

Because "msnbot/2.0b" continued to crawl numerous pages and directories that are officially off limits via META tags, robots.txt rules and X-Robots-Tag directives, I just officially blocked it.

I used to whitelist, simply, !^msnbot, but no longer. Now I'm only allowing the more mindful variations --

RewriteCond %{HTTP_USER_AGENT} !^(msnbot/1\.1�msnbot-media�msnbot-webmaster) [NC]

-- and reminding the rude newcomer of our Terms of Use for Robots:

RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]

(Example code only. Jim corrects mine on a regular basis:)

Blocking "msnbot/2.0b" may kick us out of their SERPs altogether, as blocking msnbot-media stops all MSN crawling. But I'd rather bid MSN's meager search traffic goodbye than discover its bots' significant wrongdoing after the fact. Again.

GaryK

10:19 pm on Feb 12, 2009 (gmt 0)

RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]

You're tough. Remind me never to cross you! ;) But yeah, I sort of agree with you. I've got the newcomer flagged as ban-worthy in my files for now, and maybe for good. I am still not convinced it's really from MSN. I asked my MS contact about it, but so far no reply. He doesn't usually reply until he's got an answer for me.

wilderness

10:54 pm on Feb 12, 2009 (gmt 0)

msnbot-media stops all MSN crawling

Perhaps you have some other issues?

I've had the msnbot-media denied for a long while and the other MSN bots continue to both crawl and add new pages.

dstiles

11:23 pm on Feb 12, 2009 (gmt 0)

I've had tide blocked for yonks and I'm still getting msnbot/1.1 crawling regularly.

I've also just found msnbot/2.0b, coming from 65.55.107.nnn (rDNS msnbot-65-55-107-181.search.msn.com).

Samizdata

11:24 pm on Feb 12, 2009 (gmt 0)

blocking msnbot-media stops all MSN crawling

Not so in my experience, I too blocked it long ago and regular msnbot crawling is unaffected.

Before taking the decision to block it I tried to find official information about msnbot-media on Microsoft websites, but failed. I assumed it dealt with non-HTML files (images, video and Flash, possibly Word, Excel and PDF).

I don't want those types of files indexed, hence the block.

[edited by: Samizdata at 11:31 pm (utc) on Feb. 12, 2009]

Pfui

2:57 am on Feb 13, 2009 (gmt 0)

When I blocked msnbot-media for, given its name, the same reasons, the only file any of the msn bots ever took for days afterwards was an otherwise unchanged robots.txt. The day I allowed msnbot-media, msnbot was right back doing its usual 24/7 crawls. Clearly our mileage varies:)

Btw, msnbot/2.0b just fell into one of my projecthoneypot.org [projecthoneypot.org] traps.

[edited by: Pfui at 2:59 am (utc) on Feb. 13, 2009]

wilderness

3:10 am on Feb 13, 2009 (gmt 0)

Implemented the following three at same time.

SetEnvIfNoCase User-Agent msnbot\-MM keep_out
SetEnvIfNoCase User-Agent msnbot\-products keep_out
SetEnvIfNoCase User-Agent msnbot\-media keep_out

At that time and as I recall (without checking; somebody may recall the date or be interested in locating the reference; not me) there was some kind of MSN announcement concerning the inconsistency in their own use of bot names as applied in UA's.

MS made an official announcement, "these are our new bot names".
I added these three to robots.txt and within a short while, MSN changed their names again (or at least their conformity to these names in robots.text)and began crawling outside the boundaries of robots.txt.

Thus I implemented the denials.

edited by wilderness:

BTW they still crawl due to my lack of making robots.txt available, however the result is simply 403's.

GaryK

3:26 am on Feb 13, 2009 (gmt 0)

msndude posted this back in 2006:
[webmasterworld.com...]

GaryK

3:32 am on Feb 13, 2009 (gmt 0)

Sorry for double-posting. It's official. It'll eventually replace 1.1:

[blogs.msdn.com...]

wilderness

3:38 am on Feb 13, 2009 (gmt 0)

Thanks Gary.

From that page:

The new crawler user agent string will appear as:"
msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Seem their already deviating from their own policy, WHAT'S NEW!

65.55.106.230 - - [12/Feb/2009:10:04:16 +0000] "GET /mypage.html HTTP/1.0" 200 4375 "-" "msnbot/2.0b"

GaryK

5:00 am on Feb 13, 2009 (gmt 0)

Their own policy stated it's supposed to be coming from a tide domain, but it's not. I guess it still needs some lessons in manners before MSN removes the b token. :)

jdMorgan

12:31 pm on Feb 13, 2009 (gmt 0)

Just seen:


65.55.25.142 - - [13/Feb/2009:05:19:26 -0500] "GET /robots.txt HTTP/1.1" 200 3544 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.25.142 - - [13/Feb/2009:05:19:27 -0500] "GET / HTTP/1.1" 403 666 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" [b]"From: If-Modified-Since: Thu, 29 Jan 2009 23:47:57 GMT"[/b]

Yep, that's right, they used a "From" header to send an "If-Modified-Since" header...

However, although rDNS *does* resolve to Microsoft, it *does not* resolve to any particular host within Microsoft such as "tide" or "crawl" or "msnbot", so this might be a proxied msnbot spoof or someone's pet project.

Jim

This 41 message thread spans 2 pages: 41

msnbot/2.0b

New crawler from MS in the works?

GaryK

wilderness

wilderness

GaryK

Pfui

caribguy

GaryK

caribguy

jdMorgan

GaryK

jdMorgan

GaryK

GaryK

caribguy

Pfui

GaryK

Samizdata

dstiles

Pfui

GaryK

wilderness

dstiles

Samizdata

Pfui

wilderness

GaryK

GaryK

wilderness

GaryK

jdMorgan

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week