Forum Moderators: open

Message Too Old, No Replies

robots and https

things I’ve learned

         

lucy24

1:54 am on Dec 17, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At the beginning of November I took the plunge and moved my last site to HTTPS. This involved a one-time investment of about 90 seconds to get a free certificate and tweak htaccess ... followed by an ongoing investment of many extra hours tracking redirects in logs.

First the Bad News: Robots are not going to have much trouble with HTTPS. By raw count, in the current month’s logs almost half of all blocked requests (47%) are HTTPS. This number can only go up. But the numbers are much more cheering when you look at the unambiguously bad requests. Robots who don’t send a User-Agent at all run pretty exactly 10:1 HTTP:HTTPS. Robots who ask for things like /wp-admin or /xmlrpc.php (I don't have any top-level directories in x or w, so I just searched for /[wx]) are overwhelmingly HTTP; the HTTPS requests can be counted on your fingers. (Literally.)

What Happens in Logs

Thanks to keeping close track of HTTP-to-HTTPS redirects, I have also been obliged to follow categories of redirect that I normally ignore. In particular this involves the directory slash, about which more below. Since I don’t log headers on redirects, there is no telling what proportion of requests would have been redirected on canonicalization grounds alone:
http://example.com/
http://www.example.com/
will both be sent to
https://example.com/
showing the identical 301 in logs.

In the beginning, of course everything is redirected, humans and robots alike. But as the search engines catch up, the redirects drop. About three weeks in, Google--both .com and the national variants--had updated all but the most obscure pages. Humans using other search engines--bing, yahoo, DuckDuckGo, ecosia--were redirected for a bit longer.

By now, a month and a half into the move, almost the only redirects are the ones that show up without referer. Unfortunately it is well-nigh impossible to tell whether these are humans returning to a page they’ve got bookmarked, or very very very lifelike robots. Obvious-in-retrospect robots--the ones that get the HTML and nothing else, using fully humanoid UA and headers--at this point are running about half-and-half, HTTP-to-HTTPS or direct HTTPS.

That leaves the search engines and other authorized robots. At any given time, I’ve got a month of redirects in easy-to-read-and-analyze format; after that it’s back to raw logs.

Unanswerable Question: Why is, for example, Chapter 27 of {cheesy public-domain novel} so much more attractive than Chapters 26 and 28? In some cases, the individual chapters aren’t even indexed, so it can’t be that they randomly contain some text string that shows up in unrelated searches.

White Noise: Earlier this fall, someone at the University of Georgia seems to have assigned {minor work by major 18th-century satirist}. Whether because they put the link in writing or because my copy of this work comes out near the top in Google searches, there are a LOT of human requests for the file--and hence a lot of redirected human requests later on, when they either pulled it up from their browser history or retyped a now-outdated link. It must be a phenomenally popular class, because there are more of these, more often, than you would think. Enough that I have to physically scroll past them in order to look at the rest of the list.

Technical Issues

Redirect to 3xx or 4xx: It is generally understood that chained redirects, or redirects to a 404, are Not A Good Idea. But there’s only so much you can do. When the Googlebot asks for
http://example.com/string-of-garbage
I am not going to put my server to the work of checking whether /string-of-garbage exists; they get a global redirect to
https://example.com/string-of-garbage
and THEN they get their well-earned 404.

Index Redirects: We all have an /index.html redirect in place, don’t we? I have now learned by direct observation that NOBODY requests /directory/index.html
What, never?
No, never.
What, never?
Really. Never, ever. The ONLY exception is when /directory/index.html has at some time been in use as a visible URL. (This applies to a couple of directories on my personal site, going back, well, a whole lot of years. I think I instituted an index redirect in 2012.) In logs, the only /index.html redirect I ever, ever see is when I myself am checking on a new page, and do so by clicking the physical file--which of course is named “index.html”--in Fetch.

Directory-Slash Redirects: Certain parties are in ordinately fond of requesting /directory without slash when they have never been given any reason to expect anything but /directory/ with slash. In the normal course of events, this will be taken care of by mod_index in Apache, or equivalent in the server of your choice, though possibly only after the canonicalization redirect.

The exceptions are intriguing; they include--but are not limited to--pages whose URL used to be in the form /directory/subdir/PageName.html (where this was the only URL in the directory), but are now /directory/subdir/ alone.

The top offender, currently accounting for 3/4 of all without-final-slash requests, is the Applebot. Earlier this year, before I went HTTPS, I looked at logs and established that about half their initial requests--which is to say 1/3 of all requests when you include redirects--were for /directory without trailing slash.

Second-most common is bingbot, with about 1/6 of the requests (i.e. about 2/3 of what’s left when you exclude Applebot). This is the same bingbot that invests such a huge part of its crawl budget, year in and year out, on pages that have returned a 410 since 2013 and cannot possibly be linked from anywhere. There is also a scattering of slashless requests from Googlebot. I think this is more about spot-checking on their part: every few days, let’s make sure those index redirects are happening the way they’re supposed to. And a few from DotBot, about which more elsewhere.

Now, that adds up to a fair number of chained redirects that arise purely because someone requested an URL they had no business requesting. I finally got fed up and made a supplementary rule, immediately before the canonicalization redirect:
RewriteRule ^ebooks/(\w+)$ https://example.com/ebooks/$1/ [R=301,L]
In other words, a RewriteRule that does exactly what mod_index already does, except that it doesn’t bother to check whether the directory actually exists, and it concurrently canonicalizes. I decided that’s a fair compromise. Requests involving other directories are too infrequent to be worth the trouble.

In Detail

From mid-November to mid-December here’s what we see.

bingbot: 24% of all redirects. Past experience says that this percentage will creep higher and higher over the years, as bing continues asking for URLs that everyone else has long since back-burnered. In December, their requests overall run about 3:1 HTTPS:HTTP.

Googlebot: 20% of all redirects. Shortly after they discovered that HTTPS was available, they did a full spidering of the whole site, exactly as if it were a brand-new site. (I have noticed this before.) But this hasn’t stopped them from continuing to request HTTP pages. Currently requests run about 2:1 HTTPS:HTTP. Oddly, this is lower than November, when they showed about 3:1, like bing.

DotBot: 14% of all redirects. (What the heck is the DotBot, anyway? Something to do with Mozilla, I think.) Unlike other robots, their overall request pattern is still overwhelmingly HTTP over HTTPS; they request all kinds of things by HTTP, while their few HTTPS requests are strictly for pages. In fact, they are almost the only robot I know of that requests pages at HTTP that did not exist until after the site went HTTPS. (Another holdout is trendictionbot, which uses HTTP for everything on its established shopping list, even while new requests are HTTPS.)

Applebot: 12% of all redirects. It would be a lot lower if they didn’t persist in making those bogus /directory-without-slash requests. But their HTTPS requests do at least outnumber HTTP requests, though only by about 4:3.

BLEXBot: 6% of all redirects. They really seem to have got the HTTPS message; for December it runs about 7:2 HTTPS:HTTP.

MJ12bot: 5% of all redirects. In December, HTTPS requests are slightly ahead of HTTP.

Blackboard Safeassign: 4% of all redirects. This is another “white noise” heading, though. Almost all their requests were for the contents of one directory, plus a single page from a second directory, requested in vast numbers on a single date. And the whole thing looks like a misguided effort, because I don’t remember any significant human requests for those directories.
Editorial: I honestly don’t believe a computer can distinguish between plagiarism and legitimate text-matching, as when you’re quoting from a book that you are supposed to have read. I can only hope that any flags sent up by a plagiarism-checking utility are followed-up by individual human investigation on the part of the teacher.

SeznamBot: 3% of all redirects.

AhrefsBot: 3% of all redirects.

CCBot: 2% of all redirects.

Others:
-- The Knowledge AI is noteworthy because it doesn’t do HTTPS at all. Ever. It dutifully picks up redirects, but as far as I know it has never made an HTTPS request.
-- Yandex, on the other hand, likes HTTPS. I noticed on an earlier site that once it went HTTPS, all subsequent Yandex requests were consistently HTTPS, even for URLs that no longer existed on the site and had therefore never been HTTPS. In other words, the exact opposite of DotBot, which requests pages at HTTP that have only ever existed as HTTPS.

iamlost

3:24 am on Dec 17, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



DotBot is courtesy of Moz née SEOMoz !(Moz née Mozilla).

tangor

10:16 am on Dec 17, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24 ... I always look forward to your log analysis and efforts at bot management. There's always something new in your reporting!

Thanks!

SumGuy

3:15 am on Jan 8, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



I am still operating http and https independently (with identical content). I see google crawling and referring to both of them (I haven't tried to determine the ratio). I think google and bing crawl the https site less intensively than the http site (based on bringing the logs up in a text editor from time to time). Now that I think about it, applebot may not be hitting my https site (need to double check that). There are a couple of folder paths that only contain PDF files that I am redirecting from the http site to the https site.

I don't think I've ever seen dotbot. Or blexbot. What IP's do they operate from?

Why do you allow MJ12bot ? I block any /16 I see them come from. See them very very rarely now. What good is it? My impression is that it's CIA or DOD (or more likely MI5 or 6?). Majestic.

SeznamBot: 5 or so years ago when I started intensively to look at my web logs I was seeing them. And allowing them. Then I got disillusioned with them and began to block them (IP block). What good are they?

AhrefsBot: I have no use for them. IP blocked.

CCBot: I don't see them, so must have blocked them.

Yandex and the big Chinese bot (name escapes me): I block them both. I have no need for them. I think I get the odd referral from them though.

dstiles

11:23 am on Jan 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yandex is a popular search engine outside of Russia. It's worth allowing. It has a presence in USA.

I think the MJ12 bot is a distributed one - it used to be, certainly. It runs from volunteer broadband users. As such it should not be blocked per /16 as you risk blocking genuine visitors. MJ12 is a good bot (though I ban it as useless to me) and obeys robots.txt directives, which is the best way to block it.

Lucy: re: bingbot, I've noticed recently that it makes requests on the IP with no or incorrect domain name, logged as "No hostname was provided via SNI for a name based virtual host".

Since I do not allow this I regularly return 403 for those specific hits.

not2easy

1:36 pm on Jan 8, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Why do you allow MJ12bot ? I block any /16 I see them come from.

I do not block the MJ12 bot by IP or CIDR for the same reason dstiles mentions - it is a distributed bot. It can come from anywhere, even residential ISP IPs. Block the UA if you don't want it crawling. Before blocking any unknown bot you can search and find info regarding their purpose and compliance habits in these threads.

SumGuy

2:23 pm on Jan 8, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes I know mjbot operates many times from residential ISP's but I still don't know who is behind it or why an average joe would operate it (knowingly) on his home computer. If some boob is running it at home then they get a different IP from time to time they may just get one in the same /16 so that's easy to block that way. Majestic has something to do with gov intel or dod so why should I let it in the front door? Who will NOT stumble across or find my website in a search because I block it? I can tell you its distribution is limited and IP set it comes from is relatively small (in terms of IPv4) and I don't lose sleep over it. The stupid bot that is only after pdf files with a FF user-agent with /abcd at the end is becoming more problematic. Many times it comes from UK residential / consumer IP but I sometimes see it from Germany and US residential IP's.

[edited by: SumGuy at 2:28 pm (utc) on Jan 8, 2020]

SumGuy

2:25 pm on Jan 8, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



While on the topic of http vs https, what is the current state of browsers in terms of throwing up warnings to users when they try to bring up a non-redirecting http site? What clues would I look for in the logs that would tell me a user was blocked or diverted away from browsing my site (either by choice or by browser-configuration)?

lucy24

7:54 pm on Jan 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



re: bingbot, I've noticed recently that it makes requests on the IP with no or incorrect domain name, logged as "No hostname was provided via SNI for a name based virtual host".
You’ve got your own server, haven’t you? On shared hosting involving the VirtualHost envelope, requests without a hostname never reach my site, so they’re not visible to me in logs. This applies even if the site has a unique IP address, which not all sites do.

I’m generally pretty lenient with robots, so long as they appear to behave themselves. I flatly bar anyone who admits to being from China, but why shouldn’t I admit people from Russia/Turkey (Yandex) or Czech-thingy (Seznam)?

In particular, I tend to allow robots associated with some service, even if I don’t personally use the service, on a principle of “You scratch my back, I’ll scratch yours”. Sites that, for example, don’t allow the w3 link checker to do its thing, result in me having to check the link manually. This in turn means that instead of a quick HEAD request for the html alone, you're getting a full request for the page with all its supporting files, which benefits nobody. I'm not going to spend time on your page and look at its ads all over again; I just need to make sure the URL is still valid.

what is the current state of browsers in terms of throwing up warnings to users when they try to bring up a non-redirecting http site? What clues would I look for in the logs that would tell me a user was blocked or diverted away from browsing my site (either by choice or by browser-configuration)?
All I can say is that so far, Firefox doesn't yap about sites being http as such, though they will certainly kick up a fuss if you try to use https on a site that does not happen to have a certificate it recognizes, and will put up a warning if you enter login information on an http connection. (This, in fact, was why I first moved my personal site to https a couple of years ago. There’s no sensitive content--but it happens to be where I keep my piwik/matomo files, which of course involve a login.)

I seriously doubt that any current browser would refuse to send in an http request. Maybe in ten years’ time. Not long ago I finally looked up that “Upgrade-Insecure-Requests” header that has been visible in many human requests for several years now. I’d never known what it means, except--informally--that the visitor is probably human. Turns out its purpose is to tell the site that it’s OK to redirect the visitor to HTTPS. This is already a bit pointless, and will become increasingly pointless over the years, since most sites will redirect to https whether or not the visitor has given permission. I don’t know how old a human browser would have to be before it becomes unable to use HTTPS. More likely, it would be an inability to keep its certificate file up to date, rather than HTTPS connection as such. That was one of the reasons I finally had to stop using Camino.

Kendo

12:56 am on Jan 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don’t know how old a human browser would have to be before it becomes unable to use HTTPS.


SS2 and SSL3 is penalised in site security ratings that I have seen which encourage site owners to disable it at the server because they have proven them to be exploitable., and older browsers may not support TLS. Nor do the older browsers support some of the JavaScript nonsense that is used today, which is sad because I don't recall those earlier browsers being so memory resource hungry as they are today.

If I have Firefox, Chrome and Thunderbird open after a session and walk away from my desk for a couple of days, When I return I find that the memory resources consumed by those apps has grown exponentially. For example, Thunderbird increasing from 50 Mb to 1,000,000 Mb and similar bloated consumption by the others... total memory being consumed can be more than what most people have available. Current usage...
Firefox - 1.159 GB
Chrome - 0.65 GB (plus 5 more instances of Chrome processes)
Thunderbird - 0.27 GB

Makes one wonder what they get up to while I am away?

tangor

1:42 am on Jan 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's all that constant "phone home" stuff. :)

And their "privacy stuff". :(

And telemetry of all kinds (just to improve their product, of course!)

My chuckle with FF is that on a single machine that has ONLY ONE JOB AND ONE SITE to monitor will end up with as many as 4 instances of FF running as processes after 3 days ... and will crash that machine if it attempts to open ONE MORE ... (been there done that, now close FF daily, wait for all processes to "kill" and then start it back up).

Memory usage has gone WAY down since I started doing that!

YMMV

.

dstiles

11:37 am on Jan 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy:
> You’ve got your own server, haven’t you?

Yes. I turned on SSLStrictSNIVhostCheck in ssl.conf but the bot should'nt even try it. Bing is the only one that tries this, as far as I can tell.

From the MJ12 web site:

"About MJ12Bot Majestic is a UK based specialist search engine used by hundreds of thousands of businesses in 13 languages and over 60 countries to paint a map of the Internet independent of the consumer based search engines. Web site owners can see data about their own websites on majestic.com."

Which latter sentence explains why so many people distribute the bot. As I said, I block it in robots.txt and it never visits beyond that. Blocking broadband by /16 is dumb unless it's a very badly run block.

blend27

4:45 pm on Jan 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Index Redirects: We all have an /index.html redirect in place, don’t we? I have now learned by direct observation that NOBODY requests /directory/index.html
What, never?
No, never.
What, never?
Really. Never, ever.

Years ago, after I experienced an BAD URI SEO attack on one of my main sites, I renamed index page into something else and set is to be as a default document(in IIS).

Now anything that asks for it is either a bad bad monkey bot or someone is trying 'shkru' around looking for holes in a site. Any new site that is launched now-days take this into a consideration and anything that request ' index.whatever' gets their headers logged and returned blank 404. not even taking PROTOCOL into consideration.

blend27

1:23 pm on Feb 16, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don’t know how old a human browser would have to be before it becomes unable to use HTTPS.

Take a stock Blackberry browser or something from stock Samsung GNT51+, no sugar.

lucy24

7:55 pm on Apr 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Update now that it’s been six months. Listed in the same order as in the original post (1½ months after the HTTPS move):

bingbot: 25% of all redirects. This will probably hold steady for years to come. If you have an URL that was redirected in 2011, any requests for it will come from bing.

Googlebot: 9% of all redirects. In December it was 20%; Google tends to be a quicker learner than Bing.

DotBot: 34% of all redirects. This figure seesaws wildly from month to month, as they tend to do massive crawls every few months. All requests start out at HTTP, including pages that were created after the HTTPS move and therefore never existed as HTTP.

Applebot: 8% of all redirects. They continue to show a weird obsession with slashless forms
/dir/subdir
for correct
/dir/subbir/
Almost all their HTTP requests are slashless, and conversely they represent about 96% of all slashless requests. (Tangent: I recently had occasion to update my linked URL in a certain site’s profile--neither here nor Apple-related. Turns out it would not accept /dir/subdir/ but insisted on saving it as /dir/subdir without slash. This irritates me.) This behavior is limited to HTTP requests, thanks in part to the explicit redirect I described in the first post.

BLEXBot: <1% of all redirects

MJ12bot: 11% of all redirects

Blackboard Safeassign: now blocked. I generally turn a blind eye on plagiarism-checkers, but when they got to the point of requesting the same URL up to 200 times in one day--unaccompanied by even a single human visit to justify it--I threw in the towel and blocked them in early January. They’re still making requests, but are now out of sight, out of mind.

SeznamBot: <1% of all redirects

AhrefsBot: 1% of all redirects

CCBot: not seen. I initially thought they were another of those occasional robots, like DotBot, but closer inspection shows that they’re simply quicker on the uptake. In the last few months their only redirected requests have been for / root or sitemap.

The Knowledge AI 1% of all redirects. As noted at the outset, they don’t seem to be able to do HTTPS at all, but they keep patiently collecting redirects.

Yandex: Between YandexBot and YandexMobileBot, around 1% of all redirects--but they only request the root. I think this shows intelligence on their part: they assume that if the root is redirected to HTTPS, then everything else will be too.

GarlikCrawler: This is another occasional robot. Their latest activity includes the past month, so they’re at around 4% of all redirects.

tangor

10:35 pm on Apr 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Appreciate the update!

Very similar results, with GarlikCrawler actually running about 7% on my itty bitty site.