And now for something--well--
a little bit different.
Like most people, I’ve got a test site. It serves the usual test-site purposes: If I’ve learned a new word of php, I try it here first. If I’ve changed a comma in my shared htaccess, I request a page here to make sure nothing has melted. That kind of thing.
The site contains a few visible pages to amuse passing humans, because no name is so impossible that someone won’t sooner or later type it in by accident. But mainly it’s got directories with improbable names that robots won’t guess.
The whole thing is 99% roboted-out (the other 1% being things like the Twitterbot, on account of those human-accessible pages). Since it’s a dot com, its name is knowable. And since it’s got an IPv6 address, logs show both forms of visit. (IPv4 sites only show IPv4 in logs. That means I myself show up with two different addresses, depending on which site I’m visiting.) The whole thing is http, but it does have the other usual redirects--index.html and domain-name-canonicalization.
So let’s have a look at what robots are doing at a site they should not be visiting at all. This month, there didn’t happen to be any human visits, so that makes it easier.
Raw numbers: Out of 182 requests (excluding myself):
11 (6%) IPv6, the rest IPv4
80 (44%) blocked, 102 (56%) not
The non-blocked group breaks down as:
86 requests for robots.txt (counting redirects)
5 CSS or favicon
11 html, of which 4 were redirects and 4 more were from certain categories of humanoid that can’t readily be blocked even if I wanted to, which isn’t worth the trouble
for a net total of 3 (1.6%) non-blocked requests from unwanted robots.
Now, where did those requests come from?
requests for robots.txt: 37 Googlebot
22 bingbot
10 Uptimebot (I find this number quite astounding. On sites where it is not blocked, it makes about four HEAD requests for / for every robots.txt fetch)
4 YandexBot
No sign of other search engines, whether welcome or otherwise. There were also:
6 spbot (OpenLinkProfiler)
2 SurveyBot from DomainTools
3 from a robot claiming to be Chrome 45
and, finally, one each from
ips-agent
panscient
Those last two are interesting because on other sites they are blocked (that is to say, not whitelisted) due to pervasive refusal to heed robots.txt, even when presented in the simple form
User-Agent: your-name-here
Disallow: /
Possibly they do not understand
anything more complex than
User-Agent: *
Disallow: /
(In which case, ahem, you really should not be calling yourself “panscient”.) In the course of the month, nobody asked for robots.txt and then went ahead and requested a page anyway. So that’s something.
Do not pass Go, do not collect your page Before I get on to user-agents, which is what I really wanted to talk about, let’s look at requests and referers:
Predictably, 4 requests were assorted familiar referer spams in the form "junk-name.com". (Is it any wonder I view hyphenated domain names with suspicion?)
More common--14 requests (7.5%) total--were auto-referers, where the exact name of the file they were asking for was repeated in the referer slot.
A couple more
may have been legitimate refers from one or another of those data-mining entities. Probably bogus, though.
Equally predictable were the robots asking for nonexistent wp files--a total of 11 requests from 2 different visitors.
User-Agent: A total of 8 requests (4%) came in with no user-agent. These are obviously the easiest to block. Show me some ID or you’re never getting in.
If a UA-initial “Mozilla” was ever useful in identifying humans, it no longer is. 165 of the total (90%) claimed to be Mozilla Something, 139 of them Mozilla/5.
There’s a scattering of named or at least namable entities, with one or two each of (not their full UA strings):
scrapy-redis
CRAZYWEBCRAWLER
HEADMasterSEO
Virusdie
Dataprovider
and just one fake googlebot, arriving from a well-known server farm and mysteriously asking for
/language/en-GB/en-GB.xml
--a request that seems awfully specific for someone that’s never before set eyes on the site, especially since no file with this name has ever existed. Oddly, another unrelated robot on a different date also requested this file and nothing else.
That second fake-British-accented robot, incidentally, was the month’s only “Python-urllib”. There was also a lone Lynx--which I guess
could have been human. I’ve no idea what a “real” Lynx UA string looks like these days, or what headers they send.
But all this, you’ll notice, does not add up to anywhere near 80 (the number of blocked requests). That’s because robots have largely figured out that you have to put on something like a human mask. In total:
28 putative Firefox, ranging from the absurd (3.0.1) to the plausible (50)
28 putative MSIE or “Trident”, mostly 6.0 (yeah, right) but extending up to 11; this group includes a robot from 38.100 that I’ve been seeing all over the place for years, which always requests .css to go with its 403
24 putative Chrome, from 27 on up, including a few 34--though not as many as I’ve seen elsewhere
74 claimed to be on Windows NT 5, 6 or 10
7 claimed to be Linux
1 claimed to be Mac
Stay tuned for the regular At Home With The Robots, coming one of these months when I get around to it.