Forum Moderators: open

Message Too Old, No Replies

At Home with the Robots: 2019 edition

         

lucy24

12:51 am on Mar 13, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Awright, let’s see if this one can fit into a single post ...

Instead of looking at all robots everywhere, this time I’m focusing on a specific behavior. These robots’ only unifying feature is that they monitor new listings from Highly Reputable Directory--there’s a web page and also an RSS feed--and swing by to check out the most recently listed page. Now, obviously no two directories or RSS feeds have exactly the same list of robotic followers, but it gives an idea of what’s out there.

Unless otherwise noted, the robot’s normal pattern is to request robots.txt (the compliant ones, that is) and then a single copy of the latest page. Rarely they will then engage in other behavior, but the directory listing is always the trigger.

IP: When I say The Usual Suspects, it means various spots in 18, 34, 35, 52, 54, et cetera, et cetera, you know the drill. AWS, Google Cloud, assorted other big server ranges. The robots in question may be distributed, or they may simply move around a lot. A trailing “.abc” means that the robot always uses some exact IP, which I’ve obfuscated.

Referer: Obviously a robot’s referer doesn’t have the same meaning as a human referer; the robot didn’t click a link. But all the same it’s thoughtful of them to tell me how they found out about the URL. Throughout this page, “RSS” means http://example.edu/newrss.xml. (There is also http://example.edu/new.html, but this is more often used by humans.)

Last seen: When I give a “last seen” date, it means that the robot hasn’t shown itself lately--the latest listing was just a few days ago--but I’ve not yet transferred it to the Inactive list.

1. OK by Me

Admittedly my standards are not exacting. Ask for robots.txt--before you meet your first 403, ahem, not several seconds after--and give some indication that you mean to follow it. (Editorial comment: I think it’s short-sighted to ask if the robot can benefit me, personally. Maybe it won’t, but maybe it will benefit someone else engaged in a legitimate activity, and then in turn I might derive benefit from robots that visit other people’s sites.)

DeuSu

IP: 85.93.91.abc
UA: Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
Last seen: March 2018

FlipboardProxy

IP: distributed (Their web page says “the Amazon EC2 cluster”, otherwise known as The Usual Suspects.)
UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/ browserproxy)
  Mozilla/5.0 (compatible; FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)

The 1.2 UAs generally come in pairs: the long version without referer, the short version with RSS as referer. Generally it stops by twice in the course of a day or two. The 1.6 UA is used only for images, picking up a single copy of everything associated with the file that its siblings have most recently collected. (It never gets scripts or stylesheets.)

Laserlikebot

IP: variously 35 and 104
UA: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Laserlikebot/0.1)
Last seen: February 2019

Sometimes it picks up one image associated with the current page, but generally only the page. If there are other ew pages linked with the latest one (for example a book with chapters), it will pick those up a few minutes later.

Magpie Crawler

IP: changes periodically; currently 185.25.32, 185.25.35
UA: magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
Referer: RSS

robots.txt requests use the different UA “robots” (and-that’s-all).

MBCrawler

IP: 174.129.1.abc; 54.83.5.abc
UA: MBCrawler/1.0 (https://monitorbacklinks.com)

Any given visit will use either one IP or the other, at random. Behavioral quirk: Every request, including robots.txt, is preceded by a HEAD request for the same file. (This strikes me as more trouble than it’s worth.)

omgili

IP: 62.90.131.abc (since December 2018 only; previously it was 82.166.195.abc)
UA: omgili/0.5 +http://omgili.com
Last seen: February 2019

Sad but true: The name is short for “oh my god I love it”. And there’s not a thing we can do about it.

Rogerbot

IP: 209.133.111; 207.126.118
UA: rogerbot/1.0 (http://moz.com/help/pro/what-is-rogerbot-, rogerbot-crawler+shiny@moz.com)

Toshiba Digital Solution

IP: various in 13, 18, 52
UA: Toshiba Digital Solution TDSL/Nutch-1.8
Last seen: July 2018

This robot was only visible for a few months. Shrug.

trendiction

IP: 144.76
UA: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20071127 Firefox/50.03.0.0.11
Referer: as if human

Until June 2018, the UA ended in Firefox/3.0.0.11 (really) but wiser counsels must have prevailed. Distinctive behavior: After getting robots.txt, new page, and the first 11 images (always, I counted) belonging to that page, it then gets the front page plus all authorized pages linked from it except the page that started its visit. The initial page request has a truthful referer; image requests name the relevant page as referer; the subsequent spider-type requests have no referer.

VenusCrawler

IP: 68.74; 76.14
UA: VenusCrawler/Nutch-1.12 (crawler@mycompany.com)
Last seen: June 2018

I did say that my standards are not very exacting.

yacybot

IP: distributed, almost never the same one on different visits
UA: yacybot (/global; amd64 {variable-part-here} ) http://yacy.net/bot.html
Referer: often but not always “new”; there are many others
Last seen: April 2018

What I’ve given as {variable-part-here} can be absolutely anything. The two most recent, for example, are
Linux 4.13.0-32-generic; java 1.8.0_151; Australia/en
Windows 2003 5.2; java 1.8.0_131; Europe/en

2. No Thanks

And then there are the robots that don’t meet my exceedingly lax criteria, most of the time because They Didn’t Even Ask. The astute reader will notice that this list is longer than the preceding one.

In some cases I had to consult logs to see what certain robots have been up to, because after a time I get tired of tracking blocked requests and just ignore them. Like the man said, unwanted robots ye shall always have with you.

A6-Indexer

IP: The Usual Suspects
UA: A6-Indexer
Last seen: August 2018

Asks for robot.txt every few months, but ignores it.

AppEngine

IP: 107.178.194-195
UA: any and all of these, and probably others after I stopped paying attention
  AppEngine-Google; (+http://code.google.com/ appengine; appid: s~feedly-social)
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-social)
  Feedly/1.0 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)
  Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)
Last seen: still active, but I’ve been ignoring it for over a year since it is always blocked

Requests each new page 10-12 times and then goes away.

djbot

IP: 54.201.3.abc
UA: djbot/1.1 (+http://www.demandjump.com/company/about)

Requests each new page exactly twice, about 5 minutes apart.

Embedly

IP: variable, but mostly 54.204
UA: Mozilla/5.0 (compatible; Embedly/0.2; +http://support.embed.ly/)
  Mozilla/5.0 (compatible; Embedly/0.2; snap; +http://support.embed.ly/)
Last seen: January 2019

Along with the page, also requests the favicon.

Feedspotbot

IP: 54.186.248.abc
UA: Mozilla/5.0 (compatible; Feedspot/1.0 (+https://www.feedspot.com/fs/fetcher; like FeedFetcher-Google)
Referer: Feedspotbot: http://www.feedspot.com

Yes, it puts its own name in the Referer slot. It adopted the current UA in late 2018; before that it was
Mozilla/5.0 (compatible; Feedspotbot/1.0; +http://www.feedspot.com/fs/bot)
Some years ago, it even claimed to be Firefox/2.

gocrawl

IP: The Usual Suspects
UA: Googlebot (gocrawl v0.4)
  Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2
Last seen: December 2018

The gocrawl UA is for robots.txt requests, the Firefox one for the page. Shamefaced admission: Almost all robots.txt requests from this agent are blocked, and I cannot for the life of me figure out why, since I poke holes six ways from Sunday.

Googlebot-Compatible

IP: various, but mostly 54; 23.21.191.abc
UA: Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Googlebot-Compatible Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8

Most easily recognized by its pattern: a single HEAD request from 23.21 with the “lucid” UA, followed by two GET requests: one from some other IP with the same UA, and then one from that second IP with some vaguely humanoid UA such as Firefox 15 or--most recently, and most implausibly--Firefox 4.

Grammarly

IP: The Usual Suspects
UA: Grammarly/1.0 (http://www.grammarly.com)
Last seen: Still active, but I’ve been ignoring it for over a year.

Requests new pages up to half a dozen times, alternating with older pages it has met in the past. Whether new or old, it always requests the same page twice, a second or so apart.

Leaf/Darwin

IP: 99.61.71.abc
UA: Leaf/28 CFNetwork/{1.2.3} Darwin/{1.2.3} (x86_64)
Last seen: September 2018

I generally associate Darwin with image requests, but this one asks for pages. The part I’ve given as {1.2.3} (twice) changes periodically; most recently it was 811.9/16.7.0

Metadataparser

IP: mostly 52 and 54
UA: metadataparser/1.1.0 (https://github.com/bloglovin/metadataparser)

Well, at least it only requests each page once.

PaperLiBot

IP: 37.59.19; 37.187.162-167
UA: Mozilla/5.0 (compatible; PaperLiBot/2.1; https://support.paper.li/entries/20023257-what-is-paper-li)
Last seen:

Like so many of us, it changed from http to https in its UA string a year or so back. Web page says, quote: “Paper.li is a content curation service that let's you turn socially shared content into beautiful online newspapers and newsletters.” It would, however, be an untruth to say that I block them purely because of the grocer’s apo’strophe.

Slackbot

IP: The Usual Suspects
UA: Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)
Last seen: December 2018

TweetmemeBot

IP: 46.236.24-26; 185.20.6 in random alternation
UA: Mozilla/5.0 (TweetmemeBot/4.0; +http://datasift.com/ bot.html) Gecko/20100101 Firefox/31.0
Last seen: October 2018

I do not know how this robot pronounces its name. I always read it as “Tweet me! me!” but I suppose it’s really just “meme”.

uMBot

IP: 94.130.67.abc (since February 2018; has used others in the past)
UA: Mozilla/5.0 (compatible; um-FC/1.0; mailto: techinfo@ubermetrics-technologies.com)
Last seen: September 2018

Asks for robots.txt about as often as it asks for pages, but I can’t think why, since it happily ignores any Disallow it sees.

Web spyder

IP: 46.29.103.abc
UA: Web spyder
Last seen: July 2018

WordPress

This is an umbrella heading for assorted robots from assorted places, all with some form of “WordPress” in their names. The most active currently:

IP: 67.20.76.abc
UA: WordPress/{number}; http://www.johnjasonfallows.com
Referer: auto (that is, requests for /dir/page.html give http://example.com/dir/page.html as referer)

The {number} part changes periodically; in the time this UA has been in use, it has inched along from 4.9.3 to 5.1. Requests are always in triplicate, HEAD GET HEAD for the same page. A further quirk is that even though this robot’s own requests are blocked, the page in the UA often shows up as referer in requests from one of the law-abiding robots listed earlier. (Look, John Jason, all you have to do is ask.)

Zauba Crawler

IP: The Usual Suspects
UA: Zauba Crawler/1.0 (Zauba Search for Research; http://www.zauba.io/; admin@zauba.io)
Last seen: April 2018.

Dot io, incidentally, is British Indian Ocean Territory. You betcha. This robot was wildly active in March-April of last year in spite of never once meeting anything but a 403, and then disappeared as quickly as it had appeared.

ZuperlistBot

IP: 18.232.107.abc (current, has used others in the past)
UA: Mozilla/5.0 (compatible; ZuperlistBot/1.0)
Last seen: December 2018

Humanoids

Remember the old days when you could check for a leading “Mozilla” in the UA string, and be confident that the visitor was human? That was then. This is now. Almost everything here is blocked, though it’s no longer quite so easy. Some of these fall into the “just for ###s and giggles” category.

“google.com” referer

IP: 5.39.49
UA: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0
Referer: google.com (and-that’s-all)
Last seen: still active, but I’ve been ignoring it

Although it has a recurring IP and consistent UA, this robot is most easily recognized by the extreme bogusness of its referer.

Chrome/68

IP: 194.249.231.abc
UA: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Referer: RSS

This exact UA also shows up with other robots, but here it’s consistent.

Firefox/1.5

IP: 95.87.154.abc
UA: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Referer: RSS
Last seen: July 2018

Firefox/3.6b5

IP: 107.23.92.abc
UA: Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2b5) Gecko/20091204 Firefox/3.6b5
Last seen: December 2018

Firefox/49

IP: various, especially Google Cloud ranges
UA: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0
Last seen: November 2018

Iceweasel/3

IP: 78.46.38.abc
UA: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.3) Gecko/2008092814 Iceweasel/3.0.3 (Debian-3.0.3-3)
Last seen: June 2018

Tablet PC

IP: various, but mainly 182.118.20
UA: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)
  Mozilla/5.0 (Linux; U; Android 5.0.2; zh-CN; Redmi Note 3 Build/ LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 OPR/11.2.3.102637 Mobile Safari/537.36
Referer: variously auto (example.com/exact-requested-page) or root (example.com/ alone)
Last seen: August 2018

At one time this was a very vexatious robot, because it was tricky to block and almost impossible to distinguish from more-or-less-legitimate humans with the same IP or UA. I’ve been ignoring it for a while, and mercifully it looks like it finally got bored and went away.

That’s all folks. For now, anyway.

Today I Learned: That this site’s auto-linking is suppressed if the http(s) is immediately preceded by a + sign, as it generally is in UA strings. Wheee!

tangor

2:20 am on Mar 13, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As always, you provide some of the best analysis of log activity and the "threats" out there.

Thanks, lucy24!

not2easy

3:52 am on Mar 13, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Every now and then when I'm trying to check out some rare find in the logs I wind up on one of these "At Home With the Robots.." threads. Thank you lucy24 for the documentation of the lifestyles and habits of these crawlies! Many of these I have never met but if or when I do, I'll thank you again.