robots and https

At the beginning of November I took the plunge and moved my last site to HTTPS. This involved a one-time investment of about 90 seconds to get a free certificate and tweak htaccess ... followed by an ongoing investment of many extra hours tracking redirects in logs.

First the Bad News: Robots are not going to have much trouble with HTTPS. By raw count, in the current month’s logs almost half of all blocked requests (47%) are HTTPS. This number can only go up. But the numbers are much more cheering when you look at the unambiguously bad requests. Robots who don’t send a User-Agent at all run pretty exactly 10:1 HTTP:HTTPS. Robots who ask for things like /wp-admin or /xmlrpc.php (I don't have any top-level directories in x or w, so I just searched for /[wx]) are overwhelmingly HTTP; the HTTPS requests can be counted on your fingers. (Literally.)

What Happens in Logs

Thanks to keeping close track of HTTP-to-HTTPS redirects, I have also been obliged to follow categories of redirect that I normally ignore. In particular this involves the directory slash, about which more below. Since I don’t log headers on redirects, there is no telling what proportion of requests would have been redirected on canonicalization grounds alone:
http://example.com/
http://www.example.com/
will both be sent to
https://example.com/
showing the identical 301 in logs.

In the beginning, of course everything is redirected, humans and robots alike. But as the search engines catch up, the redirects drop. About three weeks in, Google--both .com and the national variants--had updated all but the most obscure pages. Humans using other search engines--bing, yahoo, DuckDuckGo, ecosia--were redirected for a bit longer.

By now, a month and a half into the move, almost the only redirects are the ones that show up without referer. Unfortunately it is well-nigh impossible to tell whether these are humans returning to a page they’ve got bookmarked, or very very very lifelike robots. Obvious-in-retrospect robots--the ones that get the HTML and nothing else, using fully humanoid UA and headers--at this point are running about half-and-half, HTTP-to-HTTPS or direct HTTPS.

That leaves the search engines and other authorized robots. At any given time, I’ve got a month of redirects in easy-to-read-and-analyze format; after that it’s back to raw logs.

Unanswerable Question: Why is, for example, Chapter 27 of {cheesy public-domain novel} so much more attractive than Chapters 26 and 28? In some cases, the individual chapters aren’t even indexed, so it can’t be that they randomly contain some text string that shows up in unrelated searches.

White Noise: Earlier this fall, someone at the University of Georgia seems to have assigned {minor work by major 18th-century satirist}. Whether because they put the link in writing or because my copy of this work comes out near the top in Google searches, there are a LOT of human requests for the file--and hence a lot of redirected human requests later on, when they either pulled it up from their browser history or retyped a now-outdated link. It must be a phenomenally popular class, because there are more of these, more often, than you would think. Enough that I have to physically scroll past them in order to look at the rest of the list.

Technical Issues

Redirect to 3xx or 4xx: It is generally understood that chained redirects, or redirects to a 404, are Not A Good Idea. But there’s only so much you can do. When the Googlebot asks for
http://example.com/string-of-garbage
I am not going to put my server to the work of checking whether /string-of-garbage exists; they get a global redirect to
https://example.com/string-of-garbage
and THEN they get their well-earned 404.

Index Redirects: We all have an /index.html redirect in place, don’t we? I have now learned by direct observation that NOBODY requests /directory/index.html
What, never?
No, never.
What, never?
Really. Never, ever. The ONLY exception is when /directory/index.html has at some time been in use as a visible URL. (This applies to a couple of directories on my personal site, going back, well, a whole lot of years. I think I instituted an index redirect in 2012.) In logs, the only /index.html redirect I ever, ever see is when I myself am checking on a new page, and do so by clicking the physical file--which of course is named “index.html”--in Fetch.

Directory-Slash Redirects: Certain parties are in ordinately fond of requesting /directory without slash when they have never been given any reason to expect anything but /directory/ with slash. In the normal course of events, this will be taken care of by mod_index in Apache, or equivalent in the server of your choice, though possibly only after the canonicalization redirect.

The exceptions are intriguing; they include--but are not limited to--pages whose URL used to be in the form /directory/subdir/PageName.html (where this was the only URL in the directory), but are now /directory/subdir/ alone.

The top offender, currently accounting for 3/4 of all without-final-slash requests, is the Applebot. Earlier this year, before I went HTTPS, I looked at logs and established that about half their initial requests--which is to say 1/3 of all requests when you include redirects--were for /directory without trailing slash.

Second-most common is bingbot, with about 1/6 of the requests (i.e. about 2/3 of what’s left when you exclude Applebot). This is the same bingbot that invests such a huge part of its crawl budget, year in and year out, on pages that have returned a 410 since 2013 and cannot possibly be linked from anywhere. There is also a scattering of slashless requests from Googlebot. I think this is more about spot-checking on their part: every few days, let’s make sure those index redirects are happening the way they’re supposed to. And a few from DotBot, about which more elsewhere.

Now, that adds up to a fair number of chained redirects that arise purely because someone requested an URL they had no business requesting. I finally got fed up and made a supplementary rule, immediately before the canonicalization redirect:

RewriteRule ^ebooks/(\w+)$ https://example.com/ebooks/$1/ [R=301,L]

In other words, a RewriteRule that does exactly what mod_index already does, except that it doesn’t bother to check whether the directory actually exists, and it concurrently canonicalizes. I decided that’s a fair compromise. Requests involving other directories are too infrequent to be worth the trouble.

In Detail

From mid-November to mid-December here’s what we see.

bingbot: 24% of all redirects. Past experience says that this percentage will creep higher and higher over the years, as bing continues asking for URLs that everyone else has long since back-burnered. In December, their requests overall run about 3:1 HTTPS:HTTP.

Googlebot: 20% of all redirects. Shortly after they discovered that HTTPS was available, they did a full spidering of the whole site, exactly as if it were a brand-new site. (I have noticed this before.) But this hasn’t stopped them from continuing to request HTTP pages. Currently requests run about 2:1 HTTPS:HTTP. Oddly, this is lower than November, when they showed about 3:1, like bing.

DotBot: 14% of all redirects. (What the heck is the DotBot, anyway? Something to do with Mozilla, I think.) Unlike other robots, their overall request pattern is still overwhelmingly HTTP over HTTPS; they request all kinds of things by HTTP, while their few HTTPS requests are strictly for pages. In fact, they are almost the only robot I know of that requests pages at HTTP that did not exist until after the site went HTTPS. (Another holdout is trendictionbot, which uses HTTP for everything on its established shopping list, even while new requests are HTTPS.)

Applebot: 12% of all redirects. It would be a lot lower if they didn’t persist in making those bogus /directory-without-slash requests. But their HTTPS requests do at least outnumber HTTP requests, though only by about 4:3.

BLEXBot: 6% of all redirects. They really seem to have got the HTTPS message; for December it runs about 7:2 HTTPS:HTTP.

MJ12bot: 5% of all redirects. In December, HTTPS requests are slightly ahead of HTTP.

Blackboard Safeassign: 4% of all redirects. This is another “white noise” heading, though. Almost all their requests were for the contents of one directory, plus a single page from a second directory, requested in vast numbers on a single date. And the whole thing looks like a misguided effort, because I don’t remember any significant human requests for those directories.
Editorial: I honestly don’t believe a computer can distinguish between plagiarism and legitimate text-matching, as when you’re quoting from a book that you are supposed to have read. I can only hope that any flags sent up by a plagiarism-checking utility are followed-up by individual human investigation on the part of the teacher.

SeznamBot: 3% of all redirects.

AhrefsBot: 3% of all redirects.

CCBot: 2% of all redirects.

Others:
-- The Knowledge AI is noteworthy because it doesn’t do HTTPS at all. Ever. It dutifully picks up redirects, but as far as I know it has never made an HTTPS request.
-- Yandex, on the other hand, likes HTTPS. I noticed on an earlier site that once it went HTTPS, all subsequent Yandex requests were consistently HTTPS, even for URLs that no longer existed on the site and had therefore never been HTTPS. In other words, the exact opposite of DotBot, which requests pages at HTTP that have only ever existed as HTTPS.

robots and https

things I’ve learned

lucy24

iamlost

tangor

SumGuy

dstiles

not2easy

SumGuy

SumGuy

lucy24

Kendo

tangor

dstiles

blend27

blend27

lucy24

tangor

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week