Welcome to WebmasterWorld Guest from 18.104.22.168
If a robot doesn’t respond to this, I try giving it a block to itelf:
and see if that works any better. On very, very rare occasions a robot doesn’t understand the first form, but honors the second. (“Never attribute to malice that which can be adequately explained by stupidity.”) Far more often, it never intended to obey in the first place.
The vanilla Googlebot accounts for about half of its crawls. Most of the rest are:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Then there’s the mobile version:
This UA, using Android, showed up on 14 April of 2016; the last appearance of the “iPhone” UA was on 18 April.
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Confession: I have no idea what this does. It only fetches images, and it’s very rare. Their web page leaves me none the wiser.
Mozilla/5.0 (compatible; GoogleDocs; apps-presentations; +http://docs.google.com)
This UA has certainly matured over the years. Originally they sent no UA at all; later they called themselves Firefox 6, and since March of 2016 they’ve had gone by Chrome/49. Unlike some search engines, Google doesn’t display a favicon next to each result; the favicon does show up whenever you list your sites in a Google property such as GSC or Profile and quite possibly others that I don’t know about.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)
Confession: I never knew this UA existed. Thanks to that Firefox/3, they have never seen anything but a 403. The UA, complete with “de” (barring a few ebooks, I have no German-language content), has existed since at least 2015. If they hadn’t come from a Google IP, I’d have assumed they were just another unwanted robot.
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:22.214.171.124; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7
Confession: Once again, color me puzzled. I remember a few years ago the SERPs always had an option for Preview, but I haven’t seen it in yoincks, so I have no idea what this UA currently does.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
Shows up periodically on any site that has a GSC (the former WMT) account.
Mozilla/5.0 (compatible; Google-Site-Verification/1.0)
I first saw this UA in May 2016. I don’t know exactly how old it is, because it comes only in response to a specific action on your part: “Fetch and Render” in the Fetch As Googlebbot section of GSC. Like most googloid functions it is not subject to robots.txt; casual experimentation shows that if you request a page in a roboted-out directory, it will do the fetch with this UA, but won’t show the “What a Human Sees” render.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/27.0.1453 Safari/537.36
:: pause for silent snicker at idea of Bing using an iPhone UA ::
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
In spite of the “media” in the name, requests were exclusively for pages.
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b
It never requests anything but /BingSiteAuth.xml
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246
This UA was rolled out in May 2016. Note the new About page, which is in English. Incidentally, Google Translate says that “seznam” means “list”.
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
which is probably exactly what it looks like, something experimental. It even asked for robots.txt :)
Mozilla/5.0 (compatible; SeznamBot/3.2-test1-1; +http://napoveda.seznam.cz/en/seznambot-intro/)
The imagebot was busy this month, accounting for about 2/3 of all requests.
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
This UA was rare; it asked for pages and supporting files (css, js) but no images.
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)
In March 2016, Yahoo! Slurp suddenly started requesting stylesheets, always with the appropriate page as referer (same as the Googlebot). On the other hand, they seem to have entirely stopped asking for images as of December 2016.
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Linking to a webpage in your UA string is generally considered A Good Thing--but, er, it only works if people can read Russian. Faute de mieux, I’ve always assumed they are a search engine. Rather a low-budget one: they do a biggish crawl every few months, at which point they show up on my Redirects lists requesting old pages that everyone else has already got sorted to their currect URL. Requests are almost exclusively pages. (Exceptions are interesting, but only if you know the site.)
Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
I believe this is a search engine. They’re definitely from Vietnam; their real name has a lot more diacritics. They’re a strong contender in the race for Highest Proportion of robots.txt requests: they generally request just one file at a time, and each is accompanied by robots.txt. Or possibly the category is Slowest Crawl, since they spent the whole month painstakingly collecting all the images that belong to one page. (It would have been two, but the images belonging to the other page that caught their fancy are in a roboted-out directory.)
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
I think they’re a Korean search engine. They first showed up in response to an RSS feed, but since then have started wandering further. They’ve got a few other user-agents, but the “faqID” one is their current favorite, accounting for about 90% of the month’s visits. That includes all robots.txt request, even when the page request will use a different UA. On two occasions, the second UA has asked for piwik.js, which they’re really not supposed to. Humph.
Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1
Robot of the French search engine Exalead. They must have noticed that my site has no French content, because they don’t come around much. Although they’re a search engine, and they periodically look at the xml sitemap, I don’t think they’ve ever done a full spidering; they come in and ask for specific pages.
Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
As I understand it, DuckDuckGo uses other robots’ crawl data and applies their own algorithm. So the only time I see them is when the Favicons-Bot comes by, indicating that I have come up in somebody’s search. I don’t know how often they re-crawl for this purpose; the closest together I’ve seen them is about 18 hours.
Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
[edited by: lucy24 at 3:38 am (utc) on May 13, 2017]
The third, minimalist UA is only for the favicon. The change from 2.3 to 2.4 happened around the 24th, with no overlap.
Mozilla/5.0 (compatible; Qwantify/2.3w; +https://www.qwant.com/)/2.3w
Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w
Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
“Was genau ist Cliqzbot?” Another of those targeted searches, I think. They are either distributed, or they sprawl so widely across 52 that there’s no telling where they really live.
Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot)
DeuSu only understands Disallow if they’re given a section to themselves in robots.txt.
Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
Although they only showed up once--robots.txt plus a page--Yeti deserves a look-in because this month marks the first sighting since ... drumroll ... July of 2014. That was at my old site, which they used to visit all the time; they’ve never before set foot on my main site. They used to change IP every year or so, but this one's been the same since mid-2013.
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/bot)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)
(i.e. identical to the two Drake Holdings forms)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
I had no idea this Chrome UA existed; they’ve been getting a 403 on the grounds of “non-bingbot from Bing/MSN range”. They make the same set of three requests every few days: first /dir/subdir1/ and then, several hours later, /dir/subdir2 (without slash) immediately redirected to /dir/subdir2/ with slash. (Paradoxically, the malformed URL is what prevents it from being blocked at the outset; it’s a very narrowly constrained RewriteRule.) Turns out this has been going on--always the same set of three--since mid-September.
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Probably distributed; I counted eight different ranges this month. But all robots.txt requests come from just two IPs, 126.96.36.199 and 188.8.131.52, which may explain why their website says that robots.txt changes can take up to a week to be recognized. In spite of this, they’ve never asked for anything in a roboted-out directory. Requests are mostly pages, with a few seemingly random images mixed in.
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)
The iPhone version is rare. Has anyone ever figured out what this robot does? Some people may remember “If robots instructions don’t mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.” The Applebot is not the only robot to adhere to this quaint misapprehension.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
“BLEXBot assists internet marketers to get information on the link structure”I am not absolutely certain it is wise to put the word “BLEXBot” and “link structure” into the same sentence. Surprisingly, only about 1/6 of the month’s requests were 404s caused by appending other people’s URLs to my paths--a pervasive problem that others have noticed too. I would have guessed closer to 95%. It looks as if they’ve had the problem since December 2016, but it has been getting worse.
BUbiNG is a scalable, fully distributed crawler, currently under development and that supersedes UbiCrawler.Although the two IP ranges belong to different hosts, there are no major differences in their behavior. (UbiCrawler must have been before my time; I find no record of it.) I put them in the “No skin off my nose” category.
Psst! DotBot! It’s tidier when the URL in your UA string doesn’t redirect--especially not to an entirely different domain. (They are not the only ones.)
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, firstname.lastname@example.org)
It would be very interesting to know what they’re looking for, since several requests were for obscure interior non-English-language pages that, to the best of my knowledge, are not linked from anywhere in the known universe.
Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)
SiteExplorer is one of the rare robots that only understand Disallow if they’re given a sector to themselves in robots.txt, so I initially thought they were non-compliant. Later evidence suggests that they’re just very, very slow on the uptake: in the course of the month they picked up robots.txt eleven times, but never screwed up the courage to ask for a page until almost the end of the month.
Mozilla/5.0 (compatible; SiteExplorer/1.1b; +http://siteexplorer.info/Backlink-Checker-Spider/)
I think their crawling happens on the fly: robots.txt, two forms of root--one of which gets a 301--and then all other pages, from top to bottom, with the same referer a human would send. In the rare case that a page is linked from widely separated directories on the same site, the referer is whichever one the robot saw first. Since they don’t come in with a shopping list, there are never any 301s or 410s. This makes it useful for record-keeping purposes: Count the number of requests, subtract two, and that’s how many visible URLs you’ve got ;)
Mozilla/5.0 (compatible; spbot/5.0.3; +http://OpenLinkProfiler.org/bot )
I did say I didn’t have very high standards when it comes to authorizing robots. They were very active in the latter months of 2016; in April they only showed up once. They’re only interested in one directory: mostly pages, but the occasional stylesheet, and sometimes the first image on a page--regardless of whether it’s a full-color frontispiece or a little icon from the navigation banner.
Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.uptime.com/uptimebot)
May be following outside links, though they did once look at the sitemap. Their web page says they’re Nutch-based, which may explain their compliance with robots.txt--not something you see every day from the 54 neighborhood.
Flipboard 1.2 is for robots.txt and pages, 1.6 is for images. For each new file, they request the HTML a few times, and the associated images just once. They don’t seem to be interested in stylesheets.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)
The name is short for OMG I Love It. Unfortunately, I am not making this up. Check for yourself. This is a recent arrival; I only started seeing it earlier this year. It picks up a page when it first learns about it, and then comes back every week or so for the same page.
I have not yet figured out the significance of the shorter UA. It made its first appearance in the latter part of 2015, but doesn’t seem to be a direct replacement of any other UA, and doesn’t have a clearly recognizable function. The formerly common visionutils has not been around since April 2016.
Uncharacteristically for a social-media-based robot, the Twitterbot asks for and obeys robots.txt. And, like the Googlebot, it never forgets. This month’s requests included an URL that I remember seeing on Redirect lists, meaning that they first learned about it no later than December 2013.
The IP varies, but it’s always the identical UA string from beginning to end, so it’s not just an extra bit tacked on to the end of a human browser.
UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0
The pattern of requests suggests that most of them are the same robot, in spite of coming from all over the map.
Sometimes humans wear this face too, but more often it’s a robot.
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
[sic] What’s better than faking your UA? Claiming to be something that would be banned in its own right.
Currently it doesn’t seem to be interested in much but /ebooks/, which strongly suggests it is picking up links from some outside source. Requests but ignores robots.txt, where it is denied by name.
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
suggesting that they all started with the same script. Unfortunately, the identical UA is still in use by humans. But many others had no UA, making for a convenient Shoot To Kill twofer.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Even then, it hedged its bets; the “gocrawl” version alternated with
Googlebot (gocrawl v0.4)
which counts as “could be better, could be worse” among humanoid UAs. All blocked, so no skin off my nose in any case.
Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2
Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)
Yes, it really says “61 subscribers” right there in the UA string. At least this month.
NewsBlur Content Fetcher - 61 subscribers - http://www.newsblur.com/site/18645/new-online-books (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
In the past they have had other, similar UAs, but for now they seem to have settled on this form. Each time they are presented with a new title, they request it over and over for a week or so and then lose interest.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)
Some robots request robots.txt only after getting the front page; this is one of them. But, since they proceeded to ask for the entire contents of a roboted-out directory, it becomes pretty academic.
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:184.108.40.206) Gecko/20070725 Firefox/220.127.116.11 - James BOT - WebCrawler http://cognitiveseo.com/bot.html
Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
I first became aware of this robot when it showed up requesting new ebooks, but it turns out it also likes the page linked from my profile here; I just never noticed because it was always blocked for one reason or another.
ltx71 - (http://ltx71.com/)
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 (NetShelter ContentScan, contact email@example.com for information)
Mozilla/5.0 (compatible; WBSearchBot/1.1; +http://www.warebay.com/bot.html)
Crawls URLs it finds listed on other sites--including but not limited to the one given in my WebmasterWorld profile. As far as I can tell, the name is meant as a joke, not as referer spam. Ancient history: Someone from this exact IP, though using a different name, asked for robots.txt on 24 November 2014. Apparently they’re still assimilating its contents; they’ve never asked for a fresh copy.
Mozilla/5.0 (compatible; Gluten Free Crawler/1.0; +http://glutenfreepleasure.com/)
Request: HEAD /
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
(always with incorrect www) They’re persistent, I’ll give them that; every few days they’re rattling the doorknob again.
I don’t know and don’t especially care whether this is the actual Xenu; all I know is, I didn’t order it. (I don’t know about Xenu’s ordinary behavior. The w3 link checker requests robots.txt on each site that it visits, and goes away weeping if it doesn’t find authorization.)
Xenu Link Sleuth/1.3.8
And your point is...? Only that the element “User-Agent:” is part of the UA string. I also met a lone
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
which similarly failed to read the Your New Robot instructions carefully enough.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
[edited by: lucy24 at 4:16 am (utc) on May 13, 2017]
in the “Extra Stupid” category is:User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31And your point is...? Only that the element “User-Agent:” is part of the UA string
this report is specific to the author's experience