Forum Moderators: open
Read robots.txt and then left, so I don't know if it actually obeys it or not.
Is this really a new msnbot in beta?
filename.html/t/t/t/
No one/nothing else has ever looked for any file with that screwy 'suffix' (and Goo catches everyone's wonky links-to).
2.) I'm seeing different hosts than the one mentioned in the OP, and with a bare UA:
msnbot-65-55-115-175.msn.com
msnbot/2.0b
msnbot-65-55-115-151.msn.com
msnbot/2.0b
[edited by: Pfui at 6:09 pm (utc) on Feb. 4, 2009]
[webmasterworld.com...]
Gary, I see that ip with
www.example.com 131.107.0.95 - - [28/Jan/2009:18:26:45 -0600] "GET /tools/widgets HTTP/1.1" 200 42598 "http://search.live.com/results.aspx?q=air+fare" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
There's no way I'd rank on those search terms.
Also saw the ip/ua wilderness mentioned.
Those "bogus search queries" from MSN are what started the hubbub here, as a matter of fact.
I allow msnbots from that IP address range, as long as all aspects of their requests comport with past msnbot behavior.
YMMV,
Jim
It appears that the Accept header has changed to "*/*" and the historically-included Accept-Encoding and From Headers are now omitted.
I should note that rDNS was valid on the requests I'm basing these observations on. A typical hostname was msnbot-65-55-106-139.search.msn.com
Jim
Is this another reason to not trust that stuff from tide*.microsoft.com is automatically a legit bot/crawler?
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET CLR 3.0.30618; .NET CLR 3.5.30428; InfoPath.2; MS-RTC LM 8; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; SLCC1; WWTClient2; SPC 3.1 P1 Tc)
131.107.0.106
tide536.microsoft.com
The pattern of files taken is that of a human, not a bot:
/main.css
/index.asp
/page-background.gif
/favicon.ico
And then a very similar user agent:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; GTB5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; Tablet PC 2.0; .NET CLR 1.1.4322)
122.38.247.3
No rDNS but belongs to Xpeed in Seoul, Korea.
PS: msnbot/2.0b+(+http://search.msn.com/msnbot.htm) returned yesterday and took robots.txt before leaving.
[edited by: GaryK at 10:50 pm (utc) on Feb. 8, 2009]
If you restrict or wrangle MSN's bots in any way and it's been a while since you used Live.com's Webmaster Tools, you might want to login to their Webmasters section ASAP because you might discover your instructions ignored.
About 45 minutes ago, after seeing a search referer that should never-ever have been a referer, I investigated and found cached pages galore -- when ALL caching is blocked by per-page META -- plus 144 links to a directory where ALL pages and dynamic files are blocked by per-page/post META AND robots.txt, by directory and file type. In a word: Dammit.
The removal process is a far, far cry from Google's handy tool: "In the future, we may provide an automated tool for these requests...". Currently, you have to fill out a form and include X, Y, and Z bits of info. Then you're given a Support Ticket Number and:
"Once we have received your request, we will process the request to remove the URL within 48 hours of the request being accepted."
One can but hope they (re)start heeding the instructions they tell us to give them, and keep heeding same until we instruct otherwise. If not, I see no reason to allow any version of msnbot and its ilk to access my sites because the time and trouble I spend tending to/after their bots simply ain't worth the slim traffic.
What's the point in disallowing files if they're gonna get included anyway.
Most botmasters really hate to be told "no" - it's pathological.
At the lower end of the market you see it in outright disobedience.
The middle ground is occupied by those who come up with all manner of excuses as to why robots.txt restrictions do not apply to them, or who interpret the exclusion protocol to suit themselves (as in "we will always take the home page because that is not crawling").
In my experience, Google is top of the range for compliance - but even their bots' (general) adherence to the rules is really only for public relations, and they will still use automated processes with other UAs to fetch files that are disallowed, not least because if they didn't then gaming their algorithm would be too easy.
Like many things in life, it's a mixture of charade and farce.
In a Utopian cyberspace there would be an enforceable robots exclusion protocol.
But we operate in a jungle, and must accept the realities.
...
I used to whitelist, simply, !^msnbot, but no longer. Now I'm only allowing the more mindful variations --
RewriteCond %{HTTP_USER_AGENT} !^(msnbot/1\.1¦msnbot-media¦msnbot-webmaster) [NC]
-- and reminding the rude newcomer of our Terms of Use for Robots:
RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]
(Example code only. Jim corrects mine on a regular basis:)
Blocking "msnbot/2.0b" may kick us out of their SERPs altogether, as blocking msnbot-media stops all MSN crawling. But I'd rather bid MSN's meager search traffic goodbye than discover its bots' significant wrongdoing after the fact. Again.
RewriteCond %{HTTP_USER_AGENT} ^msnbot/2\.0b [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L,R=301]
blocking msnbot-media stops all MSN crawling
Not so in my experience, I too blocked it long ago and regular msnbot crawling is unaffected.
Before taking the decision to block it I tried to find official information about msnbot-media on Microsoft websites, but failed. I assumed it dealt with non-HTML files (images, video and Flash, possibly Word, Excel and PDF).
I don't want those types of files indexed, hence the block.
..
[edited by: Samizdata at 11:31 pm (utc) on Feb. 12, 2009]
Btw, msnbot/2.0b just fell into one of my projecthoneypot.org [projecthoneypot.org] traps.
[edited by: Pfui at 2:59 am (utc) on Feb. 13, 2009]
SetEnvIfNoCase User-Agent msnbot\-MM keep_out
SetEnvIfNoCase User-Agent msnbot\-products keep_out
SetEnvIfNoCase User-Agent msnbot\-media keep_out
At that time and as I recall (without checking; somebody may recall the date or be interested in locating the reference; not me) there was some kind of MSN announcement concerning the inconsistency in their own use of bot names as applied in UA's.
MS made an official announcement, "these are our new bot names".
I added these three to robots.txt and within a short while, MSN changed their names again (or at least their conformity to these names in robots.text)and began crawling outside the boundaries of robots.txt.
Thus I implemented the denials.
edited by wilderness:
BTW they still crawl due to my lack of making robots.txt available, however the result is simply 403's.
[blogs.msdn.com...]
65.55.25.142 - - [13/Feb/2009:05:19:26 -0500] "GET /robots.txt HTTP/1.1" 200 3544 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.25.142 - - [13/Feb/2009:05:19:27 -0500] "GET / HTTP/1.1" 403 666 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" [b]"From: If-Modified-Since: Thu, 29 Jan 2009 23:47:57 GMT"[/b]
However, although rDNS *does* resolve to Microsoft, it *does not* resolve to any particular host within Microsoft such as "tide" or "crawl" or "msnbot", so this might be a proxied msnbot spoof or someone's pet project.
Jim