Forum Moderators: open
2019-10-14:03:13:37
URL: /robots.txt
IP: 157.55.39.64
Content-Length: 0
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Host: example.com
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache
2019-10-14:00:19:51
URL: /robots.txt
IP: 157.55.39.30
Content-Length: 0
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Host: example.com
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache
The Content-Length header doesn't mean anything; it has been part of all logged requests ever since server moved to 2.4. The Pragma header is very characteristic of bing; apparently nobody ever told them it was supplanted by Cache-Control so you don't need both. 2019-10-06:09:59:40
URL: /ebooks/images/norfolk.png
HTTPS:
IP: 40.77.167.26
Content-Length: 0
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Host: example.old
Accept-Encoding: gzip, deflate
Accept: */*
Pragma: no-cache
Connection: close
Cache-Control: no-cache
Again, utterly familiar. And yup, that means the bingbot-by-that-name has an outdated URL in its UA string. The page lists just three UA families: bingbot (vanilla, iPhone and Windows phone); AdIdxBot (same three variants); Bing Preview (vanilla and Windows Phone). No mention of msnbot at all. In that case, why does DNS say it isn't?Because bing wants to leave its options open? “We don’t currently crawl from this range, but at some hypothetical date in the future we might want to, so we’ll claim it just in case.”
They have a very ineficient IP setup for the bot.How true indeed. Consider that Google continues to do all its crawling from a single /20. (They have periodically said they might crawl from non-US, or non-ARIN, ranges for some sites, but I don't think anyone has ever seen concrete evidence.) And then by contrast bing is spread over half the world's IP space. I think Yandex has even more crawl ranges, though none is very big.
I tried 410 on old htm and accidental php pages. neither of the "big 2" takes any notice. Not sure about the others.My personal experience is that 410 makes G stop crawling a lot faster, while nobody else particularly cares. I recently threw in the towel and started returning 410 for directories that had been 301'd (to a different site) for five and a half years (literally) and saw G requests plummet. Meanwhile, bing keeps faithfully requesting one particular URL that has been returning a 410 since around 2012. I can't imagine why; it definitely isn't linked from anywhere.