Return / continuation of msnbot

Forum Moderators: open

Message Too Old, No Replies

Return / continuation of msnbot

search.msn.com/msnbot.html

dstiles

1:51 pm on Oct 16, 2019 (gmt 0)

UA: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
IP: 207.46.13.nn (part of the MSFT range)

The URL redirects to the bing bots page (which I cannot load using either PaleMoon nor Waterfox).

I've blocked this UA for a long time on my IIS server and it's not enabled on my Linux server.

From memory, this bot was a beta version many years ago. Seems an odd bot for these days, possibly an MSFT user?

lucy24

5:27 pm on Oct 16, 2019 (gmt 0)

:: detour to raw logs ::

Gosh, there is is, showing up suddenly just this past 5 October from all the usual crawl ranges (what a lot of them bing has always used!).

21% of requests are for robots.txt (in batches of up to 8 at once), which you will agree is diagnostic of bing

NO requests are for pages, while everything else is covered: images, stylesheets, fonts, scripts, even sitemap

On my personal site--which was once my comprehensive site, lo these many years ago--the only request is for a specific image that has been redirected since December 2013--and which only existed since September 2012, giving us a terminus post quem as well.

On my main site, the newest requested file was only created a few months ago.

:: further detour to relevant part of {athome} directory as I can't be bothered to spend time with archived logs ::

Have we ever seen "msnbot" by that name? Within the present geological era? The only thing I find is msnbot-media.

dstiles

6:02 pm on Oct 16, 2019 (gmt 0)

Glad I'm not alone in this! :)

That particular UA has, as I said, been blocked by me since almost forever (it seems) on IIS sites. I get so many good and bad robots on my couple of dozen sites that I seldom actually look at the entries now.

Reason I noticed them at all is that I have recently moved some half-dozen sites to linux/apache (you may recall my requests for help a while back). Those sites are very low in the hit parade, so low that I can monitor them several times a day and only get a couple of dozen page hits and rejections at a time.

Really, I was just curious as to why a beta bot had returned and was using non-bot IP ranges. And yes, I think you are correct in its targets, though so far I've only seen robots.txt - but then, the page names are no longer the same as they were, although most images are.

lucy24

7:03 pm on Oct 16, 2019 (gmt 0)

I realized I could get more information out of headers, so I spent some time poring over them, leading to discovery that on one occasion (just one) they did request a page file--an extremely old one, but with a current URL that dates back only to November 2015. Headers are identical to bingbot, except that msnbot sent the
From: bingbot(at)microsoft.com
header, while bingbot didn't. (Not a typo.) Overall, the bingbot sends this header about half the time, which is food for thought but I've never bothered to think about it.

2019-10-14:03:13:37
URL: /robots.txt
IP: 157.55.39.64
Content-Length: 0
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Host: example.com
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache

2019-10-14:00:19:51
URL: /robots.txt
IP: 157.55.39.30
Content-Length: 0
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Host: example.com
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache

The Content-Length header doesn't mean anything; it has been part of all logged requests ever since server moved to 2.4. The Pragma header is very characteristic of bing; apparently nobody ever told them it was supplanted by Cache-Control so you don't need both.

The point here is not so much which headers they send, as that they're identical--including the same order--as those sent by bingbot.

I don't log headers on non-page requests other than robots.txt (which is rewritten to a robots.php), but requests to the old site now return a 410 (because after 5� years I got tired of redirecting with attendant htaccess bloat), which does log headers:

2019-10-06:09:59:40
URL: /ebooks/images/norfolk.png
HTTPS: 
IP: 40.77.167.26
Content-Length: 0
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Host: example.old
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache

Again, utterly familiar. And yup, that means the bingbot-by-that-name has an outdated URL in its UA string. The page lists just three UA families: bingbot (vanilla, iPhone and Windows phone); AdIdxBot (same three variants); Bing Preview (vanilla and Windows Phone). No mention of msnbot at all.

Their own verification tool [bing.com] says
Yes - this IP address is a verified Bingbot IP address.

Hmmm.

dstiles

9:47 am on Oct 17, 2019 (gmt 0)

> Yes - this IP address is a verified Bingbot IP address.

In that case, why does DNS say it isn't?

I added a 0 in place of the nn and it still gave me Verified. I think, given it's an MSFT range and it's not bingbot, they are not being exactly truthful.

lucy24

4:17 pm on Oct 17, 2019 (gmt 0)

In that case, why does DNS say it isn't?

Because bing wants to leave its options open? �We don�t currently crawl from this range, but at some hypothetical date in the future we might want to, so we�ll claim it just in case.�

I cross-checked raw logs for the two IPs I picked out of headers (above). Both have been used many times by the bingbot-under-that-name, so perhaps the question is where DNS gets its information? Is bing simply sloppy about attaching the �bingbot crawl range� label everywhere it belongs?

dstiles

10:08 am on Oct 18, 2019 (gmt 0)

> Is bing simply sloppy

Could be. They have a very ineficient IP setup for the bot.

And what about webmasters who follow advice about checking DNS to see if bots are genuine? Though I doubt many do - I don't, certainly. I nominate a number of IP ranges that have shown bingbot usage and check UAs against those.

A few years ago I actually ran a dig on as many MS ranges as I could find and singled out IPranges that had msnbot in the returns; that formed the basis of my MS bot ranges.

lucy24

5:57 pm on Oct 18, 2019 (gmt 0)

They have a very ineficient IP setup for the bot.

How true indeed. Consider that Google continues to do all its crawling from a single /20. (They have periodically said they might crawl from non-US, or non-ARIN, ranges for some sites, but I don't think anyone has ever seen concrete evidence.) And then by contrast bing is spread over half the world's IP space. I think Yandex has even more crawl ranges, though none is very big.

Granted, bing uses some of those same ranges for what I call the plainclothes bingbot, but that's a comparatively small proportion of their activity. We won't even talk about how much of their crawl budget is allocated, day in and day out, to URLs that have been redirected or even 410'd for five years or more...

dstiles

10:40 am on Oct 19, 2019 (gmt 0)

I tried 410 on old htm and accidental php pages. neither of the "big 2" takes any notice. Not sure about the others.

From my own IP lists, bing has far more IP ranges (11 ranging from /24 to /15) than yandex (6 from /19 to /17). Not saying MS use anywhere near the whole IP set but bots are scattered across each range.

Bots from apple, clara, exabot etc manage with a mere 24 (as far as I can tell). Accepted they are not as big as bing but there is something wrong with the MS setup. Facebookhits seems to come from 7 ranges but they are limited to /22 to /18

At least two annoying bots (duckduckgo and cliqz) run from amazon ranges, which makes their management by webmasters... fun?

And last night I again ran Mozilla Observatory on an HTTPS site and discovered I had to open up a complete /16 (at least!) of google IPs. For a so-called security test that is definitely a "bad thing".

Back to MS - how about forcing them to relinquish a few ranges? That would tighten up their thinking. Or maybe not.

lucy24

5:13 pm on Oct 19, 2019 (gmt 0)

I tried 410 on old htm and accidental php pages. neither of the "big 2" takes any notice. Not sure about the others.

My personal experience is that 410 makes G stop crawling a lot faster, while nobody else particularly cares. I recently threw in the towel and started returning 410 for directories that had been 301'd (to a different site) for five and a half years (literally) and saw G requests plummet. Meanwhile, bing keeps faithfully requesting one particular URL that has been returning a 410 since around 2012. I can't imagine why; it definitely isn't linked from anywhere.

Remember when any major corporation could pick up an /8 for the asking? Microsoft must have been about five minutes too late, unless--hahaha--they considered the possibility and said Nah, this internet thing isn't going anywhere. So while Apple enjoys sole possession of 17, and Merck and Eli Lilly are happily selling off vast IPv4 ranges to Amazon, Microsoft is left scrambling for any bit of real estate it can find.

I've got a whopping 18 (eighteen) Yandex ranges listed, including one IPv6. But they may not currently be using all of them; in fact there's a single aa.bb.cc.dd (down to the last digit) that's by far their favorite.

dstiles

9:58 am on Oct 20, 2019 (gmt 0)

> Nah, this internet thing isn't going anywhere.

Bill Gates is reputed to have said just that. :)

The yandex ranges - you're right. On looking at my IIS security db I find 16, but on apache I have just the 6 actually calling. Could be due to geoip blocking RU and my bots setup allowing just those I've allowed, which happen to be all RU. Should include three US ranges but haven't seen them yet on apache.