Forum Moderators: open

Message Too Old, No Replies

At Home with the Law-Abiding Robots (2021)

         

lucy24

7:50 pm on Sep 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And now for something completely different ... This year, I took a full month’s archived logs--August 2021--and looked only at robots.txt requests.

On my site, robots.txt is exempt from all canonicalization redirects. Some robots seem to get confused when a robots.txt request is redirected (I noticed this years ago with www redirects), and I don’t want to give them any excuse for noncompliance.

I currently serve two versions: the “real” robots.txt which lists all the Disallowed User-Agents and directories and other stuff you’d expect to find; and a minimalist one that just says
User-Agent: *
Disallow: /
The latter is sent out to some robots that I can say for certain will always be disallowed: ones from bad neighborhoods, or known bad agents, or ones that claim to have a referer, or pretend to be human. (This means, of course, that actual humans snooping around will probably see a file that says the whole site is roboted-out. But it can’t be helped.)

The site has been HTTPS for coming up on two years. Currently, pretty exactly 2/3 of robots.txt requests come in as HTTPS. Interestingly, most law-abiding robots use both, though with a strong bias towards HTTPS. So far, only a handful of robots use HTTP/2 consistently: bingbot, AhrefsBot, Neevabot.

Some robots match the protocol of the page or other file they’re aiming for: HTTP robots.txt is followed by HTTP page, HTTPS by HTTPS page. The w3 link checker definitely seems to do this. (I can easily tell, because I’m the one that fed it the wrong protocol, or neglected to include one.)

Under the head of NSS, something like 85% of robots.txt requests are from authorized robots. Probably a little more, since the other 15% includes robots that I’m currently testing and may eventually authorize.

The winners are:

SeznamBot, 19% of all robots.txt requests. (Have I ever had a human visitor sent by the Seznam search engine? I kinda think not, but they’re welcome to stop by.)

SemrushBot, 17%. (To this day, I have no idea what this robot does. But it does no harm, and must benefit somebody somewhere.)

DotBot, about 15% of all requests, every last one HTTP. This number could actually have been much higher, since it ran a consistent 28-30 requests per day during the first part of the month. But then, as discussed in a different thread, I got exasperated by their endless HTTP requests--including pages that were only created after the site went HTTPS--and blocked them. They’re welcome to make HTTPS requests, which they are perfectly capable of, but have chosen not to do so. After a few days of 403s they stormed off in a huff and I haven’t set eyes on them since.

bingbot, 5.5%. Remember when they were king of the robots.txt hill? That was long ago.

YandexBot, 5%. Their requests are heavily weighted toward HTTPS; some spot-checking confirms that all HTTP robots.txt requests are followed by an HTTP request for the root, to confirm that they will get a redirect. (Frankly I like this behavior in Yandex. Once they know a site is HTTPS, that’s all they use.)

Googlebot, in case anyone wondered, is nowhere, averaging just 35 robots.txt requests total (15 HTTP, 20 HTTPS) or about one a day. Much like Yandex, each of those HTTPs is immediately followed by an HTTP request for some random file that will get a 301. Do they store the HTTP and HTTPS robots.txt in different places?

HTTP only:
One that stands out is the Vietnamese search engine Coccoc. Its robots.txt requests are always HTTP, though it has no trouble using HTTPS for other files.

Not so fortunate is The Knowledge AI, which to this day doesn’t seem to be able to do HTTPS at all. As a result, aside from robots.txt it has seen nothing but 301s for the last few years.

HTTPS only:
A fair number of law-abiding robots make all their requests, including robots.txt, by HTTPS. They may be matching the protocol of the page they want. Quite a few of them are targeted robots, such as the ones following an RSS feed.

YMMV:

Nutch:
I’ve never looked into the background of this robot family, but must say it’s a well-written script. In my experience, everything with “Nutch” in the User-Agent is fully robots.txt compliant, and sends human headers, so the only ones that get blocked are the ones from a bad neighborhood. (Many sites of course block Nutch by name, but I generally don’t. Among other things, I have never seen a Nutch robot requesting a supporting file, so it’s really no skin off my nose.) This past month, Nutch-based robots accounted for something over 2% of robots.txt requests, almost all of them HTTPS.

Chinese search engines:
This gets a category to itself because they are globally Disallowed (and, where necessary, physically blocked). Compliance varies widely.

The most frequent requester is petalbot (formerly aspiegelbot). Anomalously, it seems to be fully compliant, so once it’s Disallowed, there are no further requests.

YisouSpider also seems to be compliant, since it never asked for anything but robots.txt (always HTTP).

Sogou occasionally asks for robots.txt. I can’t think why, since it goes ahead and requests pages anyway.

Above all: what on earth has happened to Baidu? Did they furtively change their name and move to a new address? In the whole month of August I see only one page request (blocked, probably a faker), and no robots.txt at all.

Bad Boys:

And then there are the ones that either ask for robots.txt only to ignore it, or ask only after one or more page requests have been blocked. (The latter are not immediately followed by requests for all Disallowed directories, so I don’t know what the point is.) Here I’ll only talk about robots that have a Disallow line to themselves, meaning that they’ve got absolutely no excuse for noncompliance. In some other cases, I either never got around to making a separate line, or somehow never Disallowed them at all. Oops.

The most prominent ask-but-ignore robot has for many years been ltx71, accounting for a whopping 3.5% of robots.txt requests. I have never bothered to find out what they ostensibly do.

While fine-tooth-combing the month’s requests, I found that the Disallowed centurybot9 has renamed itself “centuryb.o.t9”, for all the good it does. Anomalously for malign robots, it used HTTP/2.

Others in the ask-and-ignore group: heritrix, linkdexbot, MegaIndex, Qwantify, Screaming Frog, SerendeputyBot, SMTBot, YaK. I think Qwantify used to be allowed, and even made it onto my Ignore list, until I caught it misbehaving.

Humanoids:
A good 5% of all robots.txt requests are from humanoid User-Agents: mainly Firefox, some Chrome, a handful of MSIE. Now, a few of those might be actual humans visiting the site and detouring to snoop--but since an unusually high number are HTTP, I tend to doubt there are many.

Yeah, Right:
In the course of the month, I found:
--a total of seven robots.txt request from “Googlebot (gocrawl v0.4)” living at various addresses. None were accompanied by any other requests; were they simply testing out a new robot?
--one robots.txt request from “Go-http-client”, with no accompanying page request
--one from User-Agent “robot”, sandwiched between a HEAD / and a GET / (both blocked). All three were unexpectedly HTTP/2, suggesting a certain inconsistency on the bot-writer’s part: they spent so much time teaching the robot to use 2.0, they forgot to name it.
--one with no User-Agent

not2easy

8:32 pm on Sep 6, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Good to see you found some time to squeeze the logs again. Really appreciate the effort and the organized results you have shared here and over the years. Thank you lucy24!