At Home with the Robots: 2017 edition

At Home with the Robots

It’s been another two years. Time to see what the robots were up to in April 2017.
2015 edition [webmasterworld.com]
2013 edition [webmasterworld.com]
2012 edition [webmasterworld.com]

In the course of April 2017, robots accounted for something under half of all requests. That means, of course, more than half of all page requests, since most ask only for HTML.

I’ve got a generally permissive attitude to robots. Ask for robots.txt, do what it says--i.e. go away at once if you find yourself denied by name, don’t request material from excluded directories, and crawl at a reasonable speed--and I’ll poke a hole for you. My default robots.txt lists user-agents sequentially:

User-Agent: name1
User-Agent: name2
User-Agent: name3
Disallow: /

If a robot doesn’t respond to this, I try giving it a block to itelf:

User-Agent: yourname
Disallow: /

and see if that works any better. On very, very rare occasions a robot doesn’t understand the first form, but honors the second. (“Never attribute to malice that which can be adequately explained by stupidity.”) Far more often, it never intended to obey in the first place.

If I give a full /32 IP, it means the robot used that IP consistently throughout all its visits. It doesn’t necessarily mean they will use the identical IP on your site. If I don’t give an IP at all, it means it came from an array of different addresses, so it is probably distributed. This being the Good Robots page, I don’t need to consider fakers.

Stop the Presses

For as long as I can remember, bingbot has been the Abou ben Adhem of robots.txt requests. This year it didn’t even make the Top Three. Almost one-quarter of all robots.txt requests came from ... Seznambot. Bet you didn’t see that one coming. Next came BLEXBot--about whom, more later--DotBot, and finally bingbot, with less than 6% of all robots.txt requests. (The Googlebot asked for robots.txt precisely 31 times--including redirects--in the course of the month, putting it in 7th place overall. I guess that’s one a day, plus one to grow on.) On the other hand, the winner in the subcategory of Redirected robots.txt Requests goes to Yandex: 12 of its requests were redirected. Really, yandex, haven’t you learned my canonical name by now?

Search Engines

The site that I’m looking at is responsive, with no separate /m/ site or UA-based CSS. Search engines may have a different UA distribution if they know that your site serves variable content, whether at the same or different URLs.

Still Number One

Googlebot was, as usual, the single largest robotic visitor--but the margin wasn’t huge.

IP: 66.249.64-79 but primarily from the 66.249.64-67 subsector (idle query: year after year, Google does its crawling from this single /20 range. Why do other search engines require such a vast array of IPs by comparison?)
UA:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The vanilla Googlebot accounts for about half of its crawls. Most of the rest are:

Googlebot-Image/1.0

Then there’s the mobile version:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This UA, using Android, showed up on 14 April of 2016; the last appearance of the “iPhone” UA was on 18 April.

The older UAs with “Googlebot-Mobile” (DoCoMo and SAMSUNG in about equal measure) haven’t been around since the end of October 2016. So in April 2017, the Googlebot as such used only three UAs.

In late-breaking news, all Googlebot requests for supporting files (css, js) this month included a referer (the page the file “belongs” to). It looks as if they started doing this in the middle of March 2017. A handful of Googlebot image requests also had a referer, but I believe these were all associated with a “Fetch as Googlebot” GSC action.

But wait, there’s more. Alongside the true Googlebot, there’s an ever-expanding list of other Googloid functions. (This list will not include the AdSense-related crawlers, though someone else might like to chime in with the relevant information.)

IP: 66.102.6-7 and 66.249.80-95
I don’t know what they do with the rest of 66.102.0-64. I have only once--ever--seen them outside 6-9, and rarely outside 6-7.

In alphabetical order:

Docs:

Mozilla/5.0 (compatible; GoogleDocs; apps-presentations; +http://docs.google.com)

Confession: I have no idea what this does. It only fetches images, and it’s very rare. Their web page leaves me none the wiser.

Favicon:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon

This UA has certainly matured over the years. Originally they sent no UA at all; later they called themselves Firefox 6, and since March of 2016 they’ve had gone by Chrome/49. Unlike some search engines, Google doesn’t display a favicon next to each result; the favicon does show up whenever you list your sites in a Google property such as GSC or Profile and quite possibly others that I don’t know about.

Image Proxy:

Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

SearchByImage:

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7

Confession: I never knew this UA existed. Thanks to that Firefox/3, they have never seen anything but a 403. The UA, complete with “de” (barring a few ebooks, I have no German-language content), has existed since at least 2015. If they hadn’t come from a Google IP, I’d have assumed they were just another unwanted robot.

Translate
This doesn’t have a UA of its own; it just appends “,gzip(gfe)” (with comma, without leading space) to the human UA string. The referer will be something involving “translate.googleusercontent.com”.

Web Preview:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36

Confession: Once again, color me puzzled. I remember a few years ago the SERPs always had an option for Preview, but I haven’t seen it in yoincks, so I have no idea what this UA currently does.

Formerly Known as Webmaster Tools

Site Verification:

Mozilla/5.0 (compatible; Google-Site-Verification/1.0)

Shows up periodically on any site that has a GSC (the former WMT) account.

Search Console:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/27.0.1453 Safari/537.36

I first saw this UA in May 2016. I don’t know exactly how old it is, because it comes only in response to a specific action on your part: “Fetch and Render” in the Fetch As Googlebbot section of GSC. Like most googloid functions it is not subject to robots.txt; casual experimentation shows that if you request a page in a roboted-out directory, it will do the fetch with this UA, but won’t show the “What a Human Sees” render.

If that UA seems familiar, it’s because Preview is identical. They must have a sentimental fondness for Chrome 27.

We Try Harder

Thanks to our friend the BLEXbot, Bing slips to #3 in the overall request count. But it’s pretty close.

IP: 40.77.167, 157.55.39, 207.46.13
Obviously these are not Bing’s full ranges. But all requests throughout the month were evenly divided between these three /24 sectors.

UAs:

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

:: pause for silent snicker at idea of Bing using an iPhone UA ::

Unlike Google, Bing uses the same UA for both pages and images. No request had a referer. The mobile UA accounted for about 15% of requests in all categories. The only exception is that the iPhone bingbot never asks for robots.txt under its own name.

About 10% of bing requests came from The Robot That Will Never Die:

msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

In spite of the “media” in the name, requests were exclusively for pages.

And then there’s Bing Preview. In addition to the three bing/msn crawl ranges, it also shows up from
65.55.210, 131.253.25-27, 199.30.24-25
UAs:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b

I’m not clear what this UA actually does. I don’t believe it is a true preview; the requests don’t come in packages (page, supporting files, images) like a human. It may be Bing’s version of a Mobile-Friendliness tester.

Unlike Google’s Site Verification, Bing’s wears plain clothes:

IP: 131.253
UA:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

It never requests anything but /BingSiteAuth.xml

Meanwhile in the Czech Republic

(I read that they officially changed the name to Czechia, but everyone who lives there hates it, which would seem to be a drawback. The website says Czech Republic.) Seznam has always been fond of my site; not sure why, since human visitors sent by Seznam can be counted on your fingers.

IPv4: 77.75.76-79
IPv6: 2a02:598
UA:

Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)

This UA was rolled out in May 2016. Note the new About page, which is in English. Incidentally, Google Translate says that “seznam” means “list”.

I also found a scattered handful of:

Mozilla/5.0 (compatible; SeznamBot/3.2-test1-1; +http://napoveda.seznam.cz/en/seznambot-intro/)

which is probably exactly what it looks like, something experimental. It even asked for robots.txt :)

Yandex Carries On

This year, Yandex’s distinguishing trait was the sheer range of IPs they used.

IP: 141.8.143.141 (their hands-down favorite, down to the last /32); 5.255.250-253, 77.88.0-63, 100.43.64-95, 141.8.142-143; rarely 93.158.128-91, 199.21.96-99 (Yandex’s ARIN range)

UA:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

The imagebot was busy this month, accounting for about 2/3 of all requests.

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)

This UA was rare; it asked for pages and supporting files (css, js) but no images.

Yahoo! Slurp
IP: 68.180.228-230
UA:

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

In March 2016, Yahoo! Slurp suddenly started requesting stylesheets, always with the appropriate page as referer (same as the Googlebot). On the other hand, they seem to have entirely stopped asking for images as of December 2016.

Mail.RU
IP: 217.69.133
UA:

Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)

Linking to a webpage in your UA string is generally considered A Good Thing--but, er, it only works if people can read Russian. Faute de mieux, I’ve always assumed they are a search engine. Rather a low-budget one: they do a biggish crawl every few months, at which point they show up on my Redirects lists requesting old pages that everyone else has already got sorted to their currect URL. Requests are almost exclusively pages. (Exceptions are interesting, but only if you know the site.)

Minor Players

Coccocbot
IP: 123.30.175
UA:

Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)

I believe this is a search engine. They’re definitely from Vietnam; their real name has a lot more diacritics. They’re a strong contender in the race for Highest Proportion of robots.txt requests: they generally request just one file at a time, and each is accompanied by robots.txt. Or possibly the category is Slowest Crawl, since they spent the whole month painstakingly collecting all the images that belong to one page. (It would have been two, but the images belonging to the other page that caught their fancy are in a roboted-out directory.)

Daumoa
IP: 203.133.168-171
UA:

Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1

I think they’re a Korean search engine. They first showed up in response to an RSS feed, but since then have started wandering further. They’ve got a few other user-agents, but the “faqID” one is their current favorite, accounting for about 90% of the month’s visits. That includes all robots.txt request, even when the page request will use a different UA. On two occasions, the second UA has asked for piwik.js, which they’re really not supposed to. Humph.

Exabot
IP: 178.255.215
UA:

Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)

Robot of the French search engine Exalead. They must have noticed that my site has no French content, because they don’t come around much. Although they’re a search engine, and they periodically look at the xml sitemap, I don’t think they’ve ever done a full spidering; they come in and ask for specific pages.

DuckDuckGo
IP: 107.21.1.8
UA:

Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)

As I understand it, DuckDuckGo uses other robots’ crawl data and applies their own algorithm. So the only time I see them is when the Favicons-Bot comes by, indicating that I have come up in somebody’s search. I don’t know how often they re-crawl for this purpose; the closest together I’ve seen them is about 18 hours.

Special feature of this robot: All requests have an auto-referer, necessitating various hole-poking. In spite of the name, they request the root first; if the page is blocked they won’t request the favicon. (A bit funny in my case since everyone can see the favicon, barring one Italian referer-spam site because you gotta draw the line somewhere.)

To be continued...

[edited by: lucy24 at 3:38 am (utc) on May 13, 2017]

[edited by: keyplyr at 10:28 pm (utc) on Jul 4, 2017]

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot) Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy) Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)

NewsBlur Content Fetcher - 61 subscribers - http://www.newsblur.com/site/18645/new-online-books (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

At Home with the Robots: 2017 edition

Critique of the Year's Active User Agents

lucy24

lucy24

lucy24

keyplyr

not2easy

tangor

iamlost

engine

keyplyr

tangor

keyplyr

tangor

lucy24

keyplyr

Webwork

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week