Forum Moderators: open

Message Too Old, No Replies

Not at Home to the Robots: 2017 edition

         

lucy24

1:57 am on Apr 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And now for something--well--a little bit different.

Like most people, I’ve got a test site. It serves the usual test-site purposes: If I’ve learned a new word of php, I try it here first. If I’ve changed a comma in my shared htaccess, I request a page here to make sure nothing has melted. That kind of thing.

The site contains a few visible pages to amuse passing humans, because no name is so impossible that someone won’t sooner or later type it in by accident. But mainly it’s got directories with improbable names that robots won’t guess.

The whole thing is 99% roboted-out (the other 1% being things like the Twitterbot, on account of those human-accessible pages). Since it’s a dot com, its name is knowable. And since it’s got an IPv6 address, logs show both forms of visit. (IPv4 sites only show IPv4 in logs. That means I myself show up with two different addresses, depending on which site I’m visiting.) The whole thing is http, but it does have the other usual redirects--index.html and domain-name-canonicalization.

So let’s have a look at what robots are doing at a site they should not be visiting at all. This month, there didn’t happen to be any human visits, so that makes it easier.

Raw numbers:

Out of 182 requests (excluding myself):

11 (6%) IPv6, the rest IPv4

80 (44%) blocked, 102 (56%) not

The non-blocked group breaks down as:
86 requests for robots.txt (counting redirects)
5 CSS or favicon
11 html, of which 4 were redirects and 4 more were from certain categories of humanoid that can’t readily be blocked even if I wanted to, which isn’t worth the trouble

for a net total of 3 (1.6%) non-blocked requests from unwanted robots.

Now, where did those requests come from?

requests for robots.txt:

37 Googlebot
22 bingbot
10 Uptimebot (I find this number quite astounding. On sites where it is not blocked, it makes about four HEAD requests for / for every robots.txt fetch)
4 YandexBot

No sign of other search engines, whether welcome or otherwise. There were also:

6 spbot (OpenLinkProfiler)
2 SurveyBot from DomainTools
3 from a robot claiming to be Chrome 45

and, finally, one each from
ips-agent
panscient

Those last two are interesting because on other sites they are blocked (that is to say, not whitelisted) due to pervasive refusal to heed robots.txt, even when presented in the simple form
User-Agent: your-name-here
Disallow: /
Possibly they do not understand anything more complex than
User-Agent: *
Disallow: /
(In which case, ahem, you really should not be calling yourself “panscient”.) In the course of the month, nobody asked for robots.txt and then went ahead and requested a page anyway. So that’s something.

Do not pass Go, do not collect your page

Before I get on to user-agents, which is what I really wanted to talk about, let’s look at requests and referers:

Predictably, 4 requests were assorted familiar referer spams in the form "junk-name.com". (Is it any wonder I view hyphenated domain names with suspicion?)
More common--14 requests (7.5%) total--were auto-referers, where the exact name of the file they were asking for was repeated in the referer slot.
A couple more may have been legitimate refers from one or another of those data-mining entities. Probably bogus, though.

Equally predictable were the robots asking for nonexistent wp files--a total of 11 requests from 2 different visitors.

User-Agent:

A total of 8 requests (4%) came in with no user-agent. These are obviously the easiest to block. Show me some ID or you’re never getting in.

If a UA-initial “Mozilla” was ever useful in identifying humans, it no longer is. 165 of the total (90%) claimed to be Mozilla Something, 139 of them Mozilla/5.

There’s a scattering of named or at least namable entities, with one or two each of (not their full UA strings):

scrapy-redis
CRAZYWEBCRAWLER
HEADMasterSEO
Virusdie
Dataprovider

and just one fake googlebot, arriving from a well-known server farm and mysteriously asking for
/language/en-GB/en-GB.xml
--a request that seems awfully specific for someone that’s never before set eyes on the site, especially since no file with this name has ever existed. Oddly, another unrelated robot on a different date also requested this file and nothing else.

That second fake-British-accented robot, incidentally, was the month’s only “Python-urllib”. There was also a lone Lynx--which I guess could have been human. I’ve no idea what a “real” Lynx UA string looks like these days, or what headers they send.

But all this, you’ll notice, does not add up to anywhere near 80 (the number of blocked requests). That’s because robots have largely figured out that you have to put on something like a human mask. In total:

28 putative Firefox, ranging from the absurd (3.0.1) to the plausible (50)
28 putative MSIE or “Trident”, mostly 6.0 (yeah, right) but extending up to 11; this group includes a robot from 38.100 that I’ve been seeing all over the place for years, which always requests .css to go with its 403
24 putative Chrome, from 27 on up, including a few 34--though not as many as I’ve seen elsewhere

74 claimed to be on Windows NT 5, 6 or 10
7 claimed to be Linux
1 claimed to be Mac

Stay tuned for the regular At Home With The Robots, coming one of these months when I get around to it.

keyplyr

5:05 am on Apr 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think bot traffic on a test site that has (presumably) no incoming links, no SE indexing, no promotion (directories, social media, etc) and no thematic category, must be unique to crawlers hitting hosting companies & Whois... unless one of the above *is* happening.

You said you allow the facebook bot. Why would it request files if you or someone hasn't posted a link there? It's vertical not linear.

I don't have a test site per se. I run a Tomcat server on an old Linux box that I use for testing, all offline now.

TorontoBoy

1:50 pm on Apr 5, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks for the test and analysis. I have always wanted to know how other sites fare with bots in the wild. Many of the bots mentioned also visit me.

lucy24

4:43 pm on Apr 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You said you allow the facebook bot.

Twitterbot actually--but in theory facebook could get in too, since the site is on the same header-based access controls as everyone else, and facebook (unlike twitter) doesn't ask for robots.txt. I think Skype can also get in, should it so choose. The scenario--rare but it has occurred--is:

-- human lands on front page, either by random type-in or because their search string happened to closely match the site name, leading to human curiosity overcoming the “A result for this page is not available because” blahblah
-- once there, human clicks on “As long as you’re here...” link, which exists purely to amuse humans
-- eventually human lands on a page that amuses them sufficiently to Tweet it.

Technically everything in my userspace works by blacklisting, in the sense that it's
Allow from all
Deny from blahblah
But it's de facto whitelisting, because it's all built around
BrowserMatch NiceBot !blahblah
and so on.

Edit:
all offline now
I've got MAMP, and now do most basics like testing php changes there. But I still do a test-request Every Single Time I've modified htaccess, because I know from painful experience what can happen in the 1 time out of 1000 that I've goofed.