Welcome to WebmasterWorld Guest from 54.82.10.219

Forum Moderators: Ocean10000 & keyplyr

Message Too Old, No Replies

msnbot/2.0b

New crawler from MS in the works?

     
5:04 pm on Feb 1, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts: 2251
votes: 0


msnbot/2.0b ( [search.msn.com...]
131.107.0.95
tide525.microsoft.com

Read robots.txt and then left, so I don't know if it actually obeys it or not.

Is this really a new msnbot in beta?

9:13 pm on Feb 1, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5496
votes: 3


Gary,
In the event that MS has their entire bot services coming from that Class B, than, I'll simply be missing the benefits of MS.

That IP range has a history as bad or worse than many of the major colo's that many of us have denied.

Don

12:24 am on Feb 2, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5496
votes: 3


Coincidence?

131.107.151.157 - - [01/Feb/2009:19:46:41 +0000] "GET / HTTP/1.0" 403 998 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
131.107.151.157 - - [01/Feb/2009:19:49:40 +0000] "GET / HTTP/1.0" 403 998 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

4:12 am on Feb 2, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


That's good enough for me, Don. Thanks.
6:03 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


1.) Before I blocked it by version, msnbot/2.0b properly asked for robots.txt (where msnbot is allowed in many dirs) but then only requested a single file multiple times in numerous sessions yesterday:

filename.html/t/t/t/

No one/nothing else has ever looked for any file with that screwy 'suffix' (and Goo catches everyone's wonky links-to).

2.) I'm seeing different hosts than the one mentioned in the OP, and with a bare UA:

msnbot-65-55-115-175.msn.com
msnbot/2.0b

msnbot-65-55-115-151.msn.com
msnbot/2.0b

[edited by: Pfui at 6:09 pm (utc) on Feb. 4, 2009]

6:11 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 16, 2007
posts:846
votes: 0


Apparently /t/t/t/ is the signature of a successfully 'sploited server...

[webmasterworld.com...]

6:30 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


For the sake of accuracy, that thread mentioned \t\t\t\t as the sign of an exploited server. Not sure if \t\t\t suggests the same thing. :)
6:54 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 16, 2007
posts:846
votes: 0


Probably \t then...

Gary, I see that ip with

www.example.com 131.107.0.95 - - [28/Jan/2009:18:26:45 -0600] "GET /tools/widgets HTTP/1.1" 200 42598 "http://search.live.com/results.aspx?q=air+fare" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"

There's no way I'd rank on those search terms.

Also saw the ip/ua wilderness mentioned.

9:24 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Some time ago, MSNdude posted here, and asked us (Webmasters) not to block the "tide" hostnames at MSN. I suspect that these hosts are used for search-results- and site- quality-checking (and probably cloaking detection in some cases) -- at least, that's the impression I was left with at the time. This explains the "air+fare" referrer mentioned in the previous post.

Those "bogus search queries" from MSN are what started the hubbub here, as a matter of fact.

I allow msnbots from that IP address range, as long as all aspects of their requests comport with past msnbot behavior.

YMMV,
Jim

10:35 pm on Feb 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Thanks, Jim. In this case though I have to wonder if the user agent was legit or not. Seems to me we'd hear if MSN was beta testing a new bot. Then again, maybe not. I'll keep an eye on it and see how it behaves during future visits.
3:15 pm on Feb 8, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


OK, to add to the confusion, some bright boy at MS has apparently modified msnbot/2.0b to send different HTTP headers than previous versions.

It appears that the Accept header has changed to "*/*" and the historically-included Accept-Encoding and From Headers are now omitted.

I should note that rDNS was valid on the requests I'm basing these observations on. A typical hostname was msnbot-65-55-106-139.search.msn.com

Jim

9:30 pm on Feb 8, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Despite being asked to trust "tide" I remain unconvinced this is a legit msnbot. It was back again over the last few days. Not only didn't it obey robots.txt despite reading it, it fell into a bad bot trap and got itself banned. So until I see something official from MSN, and until it learns some manners I'll be suggesting this bot is spoofed and the "tide" net ranges are insecure.
10:37 pm on Feb 8, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Sorry for the double-post. It was too late to edit my previous reply. I'm back home now and scanning my log files analysis.

Is this another reason to not trust that stuff from tide*.microsoft.com is automatically a legit bot/crawler?

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET CLR 3.0.30618; .NET CLR 3.5.30428; InfoPath.2; MS-RTC LM 8; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; SLCC1; WWTClient2; SPC 3.1 P1 Tc)
131.107.0.106
tide536.microsoft.com

The pattern of files taken is that of a human, not a bot:

/main.css
/index.asp
/page-background.gif
/favicon.ico

And then a very similar user agent:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; GTB5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; Tablet PC 2.0; .NET CLR 1.1.4322)
122.38.247.3
No rDNS but belongs to Xpeed in Seoul, Korea.

PS: msnbot/2.0b+(+http://search.msn.com/msnbot.htm) returned yesterday and took robots.txt before leaving.

[edited by: GaryK at 10:50 pm (utc) on Feb. 8, 2009]

11:28 pm on Feb 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 16, 2007
posts:846
votes: 0


File under more Mickeysoft weirdness:

207.46.92.nn from 3 ip's looking for images and getting 403'd for the UA: "LWP::Simple/5.814" of all things :)

NetName: MICROSOFT-GLOBAL-NET
CIDR: 207.46.0.0/16

8:19 am on Feb 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


I don't know if this is related to the v2.0b bot or not but --

If you restrict or wrangle MSN's bots in any way and it's been a while since you used Live.com's Webmaster Tools, you might want to login to their Webmasters section ASAP because you might discover your instructions ignored.

About 45 minutes ago, after seeing a search referer that should never-ever have been a referer, I investigated and found cached pages galore -- when ALL caching is blocked by per-page META -- plus 144 links to a directory where ALL pages and dynamic files are blocked by per-page/post META AND robots.txt, by directory and file type. In a word: Dammit.

The removal process is a far, far cry from Google's handy tool: "In the future, we may provide an automated tool for these requests...". Currently, you have to fill out a form and include X, Y, and Z bits of info. Then you're given a Support Ticket Number and:

"Once we have received your request, we will process the request to remove the URL within 48 hours of the request being accepted."

One can but hope they (re)start heeding the instructions they tell us to give them, and keep heeding same until we instruct otherwise. If not, I see no reason to allow any version of msnbot and its ilk to access my sites because the time and trouble I spend tending to/after their bots simply ain't worth the slim traffic.

8:25 pm on Feb 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


What's the point in disallowing files if they're gonna get included anyway. I expect that from some bots, but not the big three.
9:41 pm on Feb 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1356
votes: 16


What's the point in disallowing files if they're gonna get included anyway.

Most botmasters really hate to be told "no" - it's pathological.

At the lower end of the market you see it in outright disobedience.

The middle ground is occupied by those who come up with all manner of excuses as to why robots.txt restrictions do not apply to them, or who interpret the exclusion protocol to suit themselves (as in "we will always take the home page because that is not crawling").

In my experience, Google is top of the range for compliance - but even their bots' (general) adherence to the rules is really only for public relations, and they will still use automated processes with other UAs to fetch files that are disallowed, not least because if they didn't then gaming their algorithm would be too easy.

Like many things in life, it's a mixture of charade and farce.

In a Utopian cyberspace there would be an enforceable robots exclusion protocol.

But we operate in a jungle, and must accept the realities.

...

11:22 pm on Feb 10, 2009 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts: 3209
votes: 17


It's reasonably simple to detect the big three. If one of them decides to disobey robots.txt then the simple answer is to terminate the page before anything is displayed and return an error code of choice. Is there a code that says, "*!%$* off MS"? :)
10:13 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


Because "msnbot/2.0b" continued to crawl numerous pages and directories that are officially off limits via META tags, robots.txt rules and X-Robots-Tag directives, I just officially blocked it.

I used to whitelist, simply, !^msnbot, but no longer. Now I'm only allowing the more mindful variations --

RewriteCond %{HTTP_USER_AGENT} !^(msnbot/1\.1¦msnbot-media¦msnbot-webmaster) [NC]

-- and reminding the rude newcomer of our Terms of Use for Robots:

RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]

(Example code only. Jim corrects mine on a regular basis:)

Blocking "msnbot/2.0b" may kick us out of their SERPs altogether, as blocking msnbot-media stops all MSN crawling. But I'd rather bid MSN's meager search traffic goodbye than discover its bots' significant wrongdoing after the fact. Again.

10:19 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]

You're tough. Remind me never to cross you! ;) But yeah, I sort of agree with you. I've got the newcomer flagged as ban-worthy in my files for now, and maybe for good. I am still not convinced it's really from MSN. I asked my MS contact about it, but so far no reply. He doesn't usually reply until he's got an answer for me.
10:54 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5496
votes: 3


msnbot-media stops all MSN crawling

Perhaps you have some other issues?

I've had the msnbot-media denied for a long while and the other MSN bots continue to both crawl and add new pages.

11:23 pm on Feb 12, 2009 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts: 3209
votes: 17


I've had tide blocked for yonks and I'm still getting msnbot/1.1 crawling regularly.

I've also just found msnbot/2.0b, coming from 65.55.107.nnn (rDNS msnbot-65-55-107-181.search.msn.com).

11:24 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1356
votes: 16


blocking msnbot-media stops all MSN crawling

Not so in my experience, I too blocked it long ago and regular msnbot crawling is unaffected.

Before taking the decision to block it I tried to find official information about msnbot-media on Microsoft websites, but failed. I assumed it dealt with non-HTML files (images, video and Flash, possibly Word, Excel and PDF).

I don't want those types of files indexed, hence the block.

..

[edited by: Samizdata at 11:31 pm (utc) on Feb. 12, 2009]

2:57 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


When I blocked msnbot-media for, given its name, the same reasons, the only file any of the msn bots ever took for days afterwards was an otherwise unchanged robots.txt. The day I allowed msnbot-media, msnbot was right back doing its usual 24/7 crawls. Clearly our mileage varies:)

Btw, msnbot/2.0b just fell into one of my projecthoneypot.org [projecthoneypot.org] traps.

[edited by: Pfui at 2:59 am (utc) on Feb. 13, 2009]

3:10 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5496
votes: 3


Implemented the following three at same time.

SetEnvIfNoCase User-Agent msnbot\-MM keep_out
SetEnvIfNoCase User-Agent msnbot\-products keep_out
SetEnvIfNoCase User-Agent msnbot\-media keep_out

At that time and as I recall (without checking; somebody may recall the date or be interested in locating the reference; not me) there was some kind of MSN announcement concerning the inconsistency in their own use of bot names as applied in UA's.

MS made an official announcement, "these are our new bot names".
I added these three to robots.txt and within a short while, MSN changed their names again (or at least their conformity to these names in robots.text)and began crawling outside the boundaries of robots.txt.

Thus I implemented the denials.

edited by wilderness:

BTW they still crawl due to my lack of making robots.txt available, however the result is simply 403's.

3:26 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


msndude posted this back in 2006:
[webmasterworld.com...]
3:32 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Sorry for double-posting. It's official. It'll eventually replace 1.1:

[blogs.msdn.com...]

3:38 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5496
votes: 3


Thanks Gary.

From that page:

The new crawler user agent string will appear as:"

msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Seem their already deviating from their own policy, WHAT'S NEW!

65.55.106.230 - - [12/Feb/2009:10:04:16 +0000] "GET /mypage.html HTTP/1.0" 200 4375 "-" "msnbot/2.0b"

5:00 am on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Their own policy stated it's supposed to be coming from a tide domain, but it's not. I guess it still needs some lessons in manners before MSN removes the b token. :)
12:31 pm on Feb 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Just seen:

65.55.25.142 - - [13/Feb/2009:05:19:26 -0500] "GET /robots.txt HTTP/1.1" 200 3544 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.25.142 - - [13/Feb/2009:05:19:27 -0500] "GET / HTTP/1.1" 403 666 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" [b]"From: If-Modified-Since: Thu, 29 Jan 2009 23:47:57 GMT"[/b]

Yep, that's right, they used a "From" header to send an "If-Modified-Since" header...

However, although rDNS *does* resolve to Microsoft, it *does not* resolve to any particular host within Microsoft such as "tide" or "crawl" or "msnbot", so this might be a proxied msnbot spoof or someone's pet project.

Jim

This 41 message thread spans 2 pages: 41
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members