Brief background: On my site, new robots have to pass through an approval stage. It is not very exacting.
Step 1: Ask for robots.txt ... before you ask for any other file, including the root. If you don't ask, you had better have a very good reason for existing.
Step 2: Do what it says. Obviously a brand-new robot will not find its name in the Disallow list. But one roboted-out directory contains pages that are linked from the root. If you request any of those pages, you can be relatively confident you will never proceed to
Step 3: Once a robot has convinced me it intends to be compliant, it gets authorized, typically in the form of un-setting any violations it has committed (such as failing to send the Accept: header, or coming from an unsavory neighborhood).
Step 4: If an authorized robot gets in the habit of visiting regularly, week in and week out, it eventually goes on the Ignore list: when I process raw logs, its requests are disregarded. I know the Googlebot exists; I’m not especially interested in what, exactly, it does.
This year’s article is about robots that have made it to Step 4, out of sight, out of mind.
When I say “abc” in an IP, it means that the final segment is always the same number, but I’ve obfuscated it.
1. US Search Engines
Google
IP: 66.249.64-79
Say what you will about G, they have done a phenomenal job of keeping all their crawling to a single /20. How on earth do they do it?
UA:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36
Googlebot-Image/1.0
Googlebot/2.1 (+http://www.google.com/bot.html)
Notes:
Googlebot-Image does what its name indicates. It is responsible for about 1/3 of all Google requests; it never sends a referer.
The mobile Googlebot does about 2/3 of what’s left, or about twice as many requests as the vanilla googlebot. Both Googlebots always send a referer when requesting scripts and stylesheets. By now, I think most robots do this, having figured out that a site may send different content depending on which page a stylesheet belongs to.
The Safari Googlebot first showed up in May 2018. It is comparatively rare--less than 2% of all Google requests--and is limited to supporting files, mostly images. It always sends a referer.
The final, shorter Googlebot UA--the one without “Mozilla”--made an isolated appearance last March (2019), but didn’t start showing up regularly until November. So far, it hasn’t picked up anything but PDFs.
There also exists a Googlebot-Video, but since I don’t have videos, I don’t know anything about its behavior.
bing
IP:
13.66.139
23.103
40.77
52.162.161, 52.240
65.55.210
131.253
157.54-60
191.232.136
199.30.16-31
207.46
This is by no means a complete list; they are merely the ones I have personally seen in the past year.
UA:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
The once-common msnbot-media seems finally to have retired. But, as discussed in other threads, msnbot suddenly reappeared in October 2019. The iPhone bingbot
:: insert appropriate ROFLMAO emoticon here ::
is quite rare compared to the mobile Googlebot, well under 10% of all requests.
UA: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
Bing Preview always comes from the 65.55 range. I don’t think it’s actually a preview; I think it’s more of an accessibility tester.
Yahoo! Slurp
Hanging on by a thread ...
IP:
68.180.228-31
72.30.14, 72.30.196-199
74.6.168, 74.6.254
216.252.126
217.146.176-191
UA: Mozilla/5.0 (compatible; Yahoo! Slurp; htjtp://help.yahoo.com/help/us/ysearch/slurp)
So rare, I frankly don’t know why it still bothers. I do still get the occasional image request giving Yahoo Search as referer.
2. National Search Engines
Listed here in order of overall frequency.
Czech Republic: Seznam
Year after year, active out of all proportion to its population. I don’t have a single word of Czech-language content.
IP:
77.75.76-79 and 77.75.73
2a02:598:a:: and 2a02:598:2::
Seznam is the only search engine I routinely see from IPv6. (This may not be wholly accurate, because only my personal site has an IPv6 address, and therefore it is the only one whose logs show IPv6 requests. Memo to self: See what interesting things happen if I give my primary site an IPv6 address. Heck, it’s free.)
UA: Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
It also has a Preview that I see fairly often, though who knows what it’s for:
UA: Mozilla/5.0 PhantomJS (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)
Russia and Turkey: Yandex
For most people this is probably the best-known non-US-based search engine. It continues to have a staggering range of IPs:
5.45.192-255
5.255.192-255
37.140.128-191
77.88.0-63
84.201.128-191
87.250.224-255
93.158.128-191
95.108.128-255
100.43.64-95
130.193.32-71 (i.e. 32-63 and 64-71)
141.8.128-191
178.154.128-255
199.21.96-99
199.36.240-243
2a02:6b8:b000::/52
but, continuing a well-established habit, 2/3 of their requests come from (down to the last digit)
IP: 141.8.144.abc
UA:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)
As with bing, YandexMobileBot is pretty rare. It gets mostly pages and stylesheets, no images.
UA: Mozilla/5.0 (compatible; YandexAccessibilityBot/3.0; +http://yandex.com/bots
Thanks to that Ignore list, I didn’t notice when this new UA showed up in March 2019.
Distinguishing feature: Yandex really likes HTTPS. Once they know that a given site is accessible securely, they will make almost all requests that way--even for pages that moved away long before the site went HTTPS.
Korea: Daum
IP: 203.133.168-171, rarely 203.133.174
(They own all of 103.133.160-191, but these are the only IPs I have ever seen for crawling.)
UA:
Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1
Technically Daumoa is not on my Ignore list, because they wear different hats. Originally I saw them responding to an RSS feed, but later they’ve expanded. They have a few other, similar user-agents.
When looking up and cross-checking between HTTP and HTTPS, I discovered Daumoa’s
Distinguishing feature: a great unwillingness to accept redirects. On the HTTP side there will be 4, 5 or 6 requests for a particular page within a few hours, all duly redirected; on the HTTPS there will be at most one request for the page.
Vietnam: CocCoc
IP: 103.131.71
They used to crawl from 123.30.175, but I haven’t seen it in over a year.
UA:
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
Like Daumoa, I’m not actually ignoring CocCoc, because they have two categories of visit: following the RSS feed, and generalized crawling. Also like Daumoa, they have a
Distinguishing feature: a remarkable appetite for robots.txt, currently running at almost 2/3 (65%) of all their requests. On closer inspection, it turns out all those requests were coming in on the HTTP side of my personal site. (I don’t canonicalize robots.txt, as this seems to confuse some robots.) So Time Will Tell if the behavior becomes wide-spread, since my primary site just went HTTPS a few months ago.
Russia: Mail.RU
IP: 95.163.248-255, 217.69.143
UA: Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
They crawl by fits and starts, and are very slow about updating their shopping lists. Every few months there will be a blizzard of requests for pages that were redirected years ago.
And, yeah, I just assume they’re a search engine, although I don’t think I’ve ever met a human sent from there.
France: Exabot
IP: 178.255.215
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
I don’t see them very often, probably because I don’t have any French-language content worth mentioning, barring the odd phrase here and there. (Admittedly, this has never deterred Seznam.)
Korea: Yeti and Linespider
IP: 125.209.235, 203.104.154
UA:
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.0 Safari/537.36 (compatible; Yeti/1.1; +http://naver.me/spd)
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd)
Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
There have been one or more recent threads about Linespider, which you can consult for details. The shorter Yeti UA was active until October 2019, at which point it was replaced by the longer version. The two crawlers, Yeti and Linespider, currently work in tandem: Linespider gets pages; Yeti comes along immediately afterward and gets the stylesheets belonging to the pages.
Japan: ichiro
Another of those “Are you still around? I thought you were dead!” crawlers.
IP: 153.254.146.abc (down to the last digit)
UA: ichiro/3.0 (http://search.goo.ne.jp/option/use/sub4/sub4-1/)
In the past year-plus, I’ve only seen them once, making a longish visit in July 2019.
3. Unwanted Search Engines
If you’re from China, I don’t want you. (Reminder: YMMV. This is not what this thread is about.) Technically none of these are on my Ignore list--but only because they are universally blocked, so I don’t notice them anyway. Anyway, I should cover them for the sake of completeness.
Baidu
IP: 180.76.15; 123.125.71, 220.181.108
UA:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
A few years back, Baidu finally started honoring robots.txt ... at least on two visits out of three.
In the same way that “Googlebot” used to be popular with US-based spoofers, I get numbers of robots pretending to be “Baiduspider”--in fact there are more fakers than the real thing.
Sogou
IP: 36.110.147, 106.38.241, 106.120.173, 123.126.113, 220.181.124
UA: Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
I thought for a moment that Sogou had finally learned the meaning of “Disallow”, since all of its recent requests have been for robots.txt. But on closer inspection they have simply been swapping out UAs: “Oh, SogouSpider is disallowed? Then I’ll call myself an Android instead”.
Yisou
IP: 42.120.160-161, 42.156.136-139, 106.11.152-159
UA:
YisouSpider
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
The shorter UA is used only for robots.txt requests.
4a. Miscellaneous
Applebot
IP: 17.58.101.abc, 17.58.96-103
Theoretically they could come from anywhere in 17, but this is where I’m currently seeing them.
UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
At one time there was also an iPhone version:
UA: Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)
but I haven’t seen it since 2017, so it must have retired.
Distinguishing feature: The unshakable belief that any URL ending in / is really an extensionless URL in disguise, leading to vast numbers of requests for /directory/subdir without final / slash. In fact, one directory now has an explicit directory-slash redirect to avoid chained redirects caused by canonicalization. Last time I counted, three-quarters of all slashless requests came from the Applebot.wq2a [Editorial comment presumably added by cat, but I’ll leave it for flavor.]
archive.org_bot
IP: 207.241.229-233; 2607:f298:5:105b:
At one time they also used 37.187.150, but I haven’t seen it since February 2019.
UA:
Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
Mozilla/5.0 (compatible; special_archiver/3.1.1 +http://www.archive.org/details/archive.org_bot)
The special_archiver version is only for supporting files--including less common things like fonts. The once-common Wayback Machine UA packed up at the end of 2018:
UA: Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot)
DotBot
IP: 216.244.66.abc and .def
UA: Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
Most robots are content with a single distinguishing feature or behavioral quirk. The DotBot has three. So far. #1: It just looooves robots.txt, even surpassing the long-time record holder, bing. #2: It also likes slashless URLs: filter out the Applebot, and 2/3 of what’s left will be the DotBot. And #3: It apparently wishes the world would stay HTTP. Almost alone among law-abiding robots, it will request pages at HTTP that only came into existence after the whole site went HTTPS, meaning that it cannot ever have seen an HTTP link to the page.
GarlikCrawler
IP: 185.26.92.abc
UA: GarlikCrawler/1.2 (http://garlik.com/, crawler@garlik.com)
At the time I counted up in logs, this robot’s visits totaled exactly 1000 over the time period in question. This pleases me.
ia_archiver
IP: 54.165.59.abc, 54.163.57.abc, 54.209.98.abc
Yes, I’m afraid those are AWS IPs. Go figure. Until I sat down to prepare this page, I did not notice that they had moved from 54.165 to 54.209 by way of 54.163. Bad luck for them, as this particular robot’s hole-poking was strictly IP-based.
UA: ia_archiver
See elsewhere about YMMV.
The Knowledge AI
IP: 66.160.140, 64.62.252
UA: The Knowledge AI
Distinguishing feature: Unable to do HTTPS. Over the years, as my sites move to HTTPS, it has picked up an increasing number of redirects, to the point where it can now retrieve no content except robots.txt. This does not, ahem, make them seem especially knowledgeable.
MojeekBot
IP: 5.102.173.abc
UA: Mozilla/5.0 (compatible; MojeekBot/0.6; +https://www.mojeek.com/bot.html)
This should possibly be listed among the search engines, but I’m sure I have never seen a human they sent, so I reserve judgement.
SemrushBot
IP:46.229.168; 85.208.96; 213.174.146, .147., 152; 192.243.53
These are the addresses I’ve personally seen them from; most are part of a wider range. It’s one of the most active robots, with more requests than anyone but the biggest search engines.
UA:
Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot-SI/0.97; +http://www.semrush.com/bot.html)
Each IP is associated with some particular UA--if you ask, they’ll send you an exact list--but I’ve never bothered to keep track.
Distinguishing feature: The various UAs send, or fail to send, slightly different headers.
SEOkicks
IP: 95.216.96.170, 138.201.30.66, 136.243.89.157
UA: Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)
Distinguishing feature: Has never requested anything but robots.txt. If it is simply checking whether a site exists and can be reached, I guess it would do as well as anything. (When I’ve edited my htaccess and need to check for errors, I just open my test site’s robots.txt and confirm that I can reach it.) Otherwise I don’t know what’s up; there’s nothing in the Disallow list that sounds at all like their name.
TurnItInBot
IP: 38.111.147.abc, 199.47.87.abc
UA: TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
A plagiarism checker, although they do use robots.txt.
Distinguishing feature: They alphabetize their shopping list, at least on sites that are small enough to be visited in one fell swoop. From this I learn that their alphabetize function, like the one in SubEthaEdit, puts capital letters before lower-case letters.
4b. Miscellaneous, Distributed
These robots come in from what can be called The Usual Suspects: various spots in 18, 34, 35, 52, 54, et cetera, et cetera, you know the drill. AWS, Google Cloud, assorted other big server ranges. The robots in question may be distributed, or they may simply move around a lot.
Ahrefs
IP: distributed, but especially 54.36.148-150
UA: Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)
Requests mostly pages, with the occasional image.
Barkrowler
UA:
Barkrowler/0.7 (+http://www.exensa.com/crawl)
Barkrowler/0.9 (+http://www.exensa.com/crawl)
I don’t know if there was ever an 0.8; on my sites it moved from 0.7 to 0.9 early in 2019. IPs are distributed from one visit to the next: any one visit, which may be quite long, uses the same IP all the way through.
All of the above may need to go into the past tense; it was last seen in July 2019.
BLEXBot
IP: distributed, but especially 46.4 and 94.130
They used to come from 148.251.244.abc but currently you can meet them everywhere.
UA: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
Distinguishing feature: A few years ago, they racked up vast numbers of 404s thanks to appending other sites’ URLs to my paths. Mercifully, they fixed the problem in 2017 and it hasn’t come back. Early in 2019 they did accumulate a few more 404s by requesting /pagename/ instead of /PageName/ (it’s an ancient URL) but they quickly got a grip.
CCBot
UA: CCBot/2.0 (http://commoncrawl.org/faq/)
One of a handful of authorized robots that still use HTTP/1.0. (Another is ia_archiver.)
Cliqzbot
UA: Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot)
Their website begins by asking “Was genau ist Cliqzbot?”, taking the words right out of my mouth.
ExtLinksBot
UA: Mozilla/5.0 (compatible; ExtLinksBot/1.5; +https://extlinks.com/Bot.html)
Their small number of requests and robots.txt compliance put them in the “No skin off my nose” category. They may actually have retired; I last saw them in March 2019.
MauiBot
UA: MauiBot (crawler.feedback+beta@gmail.com)
In the entire year-plus for which I checked logs, they made a grand total of one visit, requesting robots.txt and a single page. They must have been more active in years past, or I would never have got as far even as authorizing, let alone ignoring them.
MJ12bot
UA: Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
Rare among robots, I have seen them not only from the usual array of IPv4 addresses but also at least three different IPv6 ranges.
SafeDNSBot
UA: SafeDNSBot (https://www.safedns.com/searchbot)
Last seen in September 2019, but were never very frequent anyway.
Uptimebot
UA: Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.uptime.com/uptimebot)
Referer: http://uptime.com/example.com
Last seen: June 2019
5. No Longer Welcome
The downside to putting a robot on the Ignore list is that you don’t notice if it starts misbehaving. In the past year there have been two of note:
Blackboard Safeassign
As far as I know, this robot has never in its life asked for robots.txt. At one time it got a free pass because it performs the somewhat useful service of plagiarism checking. But I recently had to put my foot down when it had an attack of requesting the same page up to 200 times in a few minutes--not just one page, but many different ones. These requests weren’t accompanied by an unusual number of human requests for the same pages--it’s generally easy to see when a school has assigned a particular title--which might have helped to explain it.
IP: 34.231.5.abc, 34.202.93.abc (very rarely others)
UA: Blackboard Safeassign
Distinguishing feature: All requests come in pairs, HEAD followed by GET.
Qwantify
This robot spent several years on the Ignore list. As a result, I did not immediately notice that it had been crawling directories that are plainly marked Disallow: in robots.txt. Not just once--everyone has the occasional accident--but repeatedly. Slam!
IP: 91.242.162, 194.187.170-171
UA:
Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w
Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)
Mozilla/5.0 (compatible; Qwantify/Bleriot/1.2.1; +https://help.qwant.com/bot)
Mozilla/5.0 (compatible; Qwantify/Mermoz/0.1; +https://www.qwant.com/; +https://www.github.com/QwantResearch/mermoz)
Qwantify/1.0
That last UA is only for the favicon--currently the only thing it allowed to see. It was on extended vacation from mid-2018 until December 2019. I never consider a robot truly gone until at least two years have elapsed.
The four full-length UAs seem to be used in random alternation. I think Bleriot/1.1.1 is the newest UA on the list; it only showed up in October 2019.
And the winners are...
total requests:
Continuing a long-established pattern, bing wins hands-down, with more than twice as many requests as Google. In third place is Seznam, followed by--of all things--SemrushBot.
robots.txt, raw count:
Surprise! The bingbot is no longer in first place, though it will probably never drop out of the Top Five. In order:
Seznam
DotBot
bingbot
SemrushBot
Yandex
robots.txt as percentage of requests:
This one’s tricky, because a minor robot may request just one file per visit, days or weeks apart, resulting in a robots.txt proportion of around 50%. Noteworthy are:
SEOkicks 100%
coccocbot 65%
Among search engines with a high enough total to be worth counting:
Mail.RU 45%
Exabot 42%
That’s all folks. For now, anyway.