Previously functioning .htaccess now works unexpectedly

Forum Moderators: phranque

Message Too Old, No Replies

Previously functioning .htaccess now works unexpectedly

thord

6:13 pm on Oct 20, 2019 (gmt 0)

RewriteCond %{HTTP_HOST} ^((www\.)?(exampleA|exampleB|exampleC)|example)\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^(www\.)?example\.net [NC]
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

A request for exampleA.com etc. or example.net should all result in page https://www.example.com/ being delivered, and that's exactly how it works in my browser.

But now the Google Search Console has started to include, in the Top linking pages report, links to https://www.example.com from
http://example.net
That will obviously result in a duplicate content problem. (Also note: no https and no www.)

Any explanation why the rewrite rule doesn't work for Googlebot anymore? How could those spurious "backlinks" from example.net be eliminated? (example.net can of course not be added as a property in GSC to enable research.)

ANOTHER QUESTION:
A request for mail.exampleA.com (etc.) will result in
https:// mail.exampleA.com/ (etc.)
being delivered. I.e. no rewriting to example.com.
And a request for mail.example.net will result in
https://mail.example.net/
being delivered.

If other made up subdomains are added to the URL no site page will be rendered. Should I add some rewrite condition for 'mail' too, or can I ignore those probes?

penders

9:40 pm on Oct 20, 2019 (gmt 0)

Any explanation why the rewrite rule doesn't work for Googlebot anymore?

No. If it works for you then it should work for Googlebot. This is of course assuming you have no spurious conditions/rules before this that might treat Googlebot differently?

Is it possible that your redirection somehow "stopped working" for a period of time?

Do you have a separate HTTP to HTTPS redirect?

Are you using absolute URLs to "https://www.example.com" for your internal linking?

(example.net can of course not be added as a property in GSC to enable research.)

Why not?

That will obviously result in a duplicate content problem. (Also note: no https and no www.)

Although unlikely to be a "problem" since all links appear to point to "www.example.com". And if the redirect is in place, then there is no duplicate content (at least not anymore).

Rather curious how Googlebot found example.net in the first place? Was this previously an active domain?

I assume all variations (HTTP vs HTTPS, www vs non-www) of example.net all point here?

If you are pointing ALL other hostnames to "www.example.com" then your redirect could be greatly simplified (less chance for error). Instead of matching every other hostname, you can simply check to see if it is NOT "www.example.com" (exactly). For example:


RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

This will naturally catch any other spurious subdomains as well (do you have wildcard/multiple subdomains configured or something? Or do these not even resolve?). However, depending on your server config, the "mail" subdomain might be "special" and out of your control.

Clear your browser cache - to make sure you aren't seeing a cached redirect and the redirects really have stopped working?!

lucy24

10:26 pm on Oct 20, 2019 (gmt 0)

You don't need the ^(www\.)? business, unless you have subdomains that need to be exempted from the rule. In fact, you don't need any of the list (as I understand it: exampleA.com, exampleB.com, exampleC.com, example.com if without www, example.net); all you need is phranque's one-line version. It means: �any and all hostnames other than my single preferred option, www.example.com�.

But now the Google Search Console has started to include, in the Top linking pages report

Are they saying anything about �via this intermediate link�? If so, you can safely ignore the whole thing.

Have you checked the logs for example.net and your other unwanted domains to verify that any requests from Googlebot are getting the desired 301 response?

It sounds counterintuitive, but you do need to put everything into GSC just to tell them which ones not to use. On a typical https site, that means you will have GSC records for
http://example.com
http://www.example.com
https://example.com
https://www.example.com
even though three of them redirect to the fourth.

thord

6:58 am on Oct 21, 2019 (gmt 0)

In order to assist other readers of this thread I will try to reply to all the questions.
PENDERS:

This is of course assuming you have no spurious conditions/rules before this that might treat Googlebot differently?

No. The previous rule, which should not be of relevance here is:
RewriteCond %{REQUEST_URI} ^([^.]+\.html)
RewriteRule \.html. https://www.example.com/$1 [R=301,L]

Is it possible that your redirection somehow "stopped working" for a period of time?

Really don't think so.

Do you have a separate HTTP to HTTPS redirect?

Yes, my last rule is:
RewriteCond %{HTTPS} off
RewriteRule (.*) https:// %{HTTP_HOST}%{REQUEST_URI} [R=301,L]
Edit: delinked by me, and, hmm... I think I don't need that redirect anymore after penders' simplified code.

Are you using absolute URLs to "https://www.example.com" for your internal linking?

Very rarely. I use relative internal links. The domain example.net is nowhere mentioned in the pages' text on the site.

(example.net can of course not be added as a property in GSC to enable research.) Why not?

A site example.net doesn't exist. It's just a parked domain. But, admittedly, I haven't tried to add it as a property of mine.

Was this previously an active domain?

Never has been. example.net has always been just parked.

I assume all variations (HTTP vs HTTPS, www vs non-www) of example.net all point here?

Yes, to https://www.example.com/

your redirect could be greatly simplified

Changed my redirect (as in the OP) to this simplified one, and it works great. Thank you very much. I use no subdomains and the only one that resolved was that 'mail', but not so anymore. 'mail' can well be out of my control, because I'm on shared hosting. Cache cleared.

LUCY24:

Are they saying anything about �via this intermediate link�?

Unfortunately not. So I'm worried and confused. Hope the simplified redirect took care of this problem. Those unexplained backlinks from example.net started to appear in GSC about a month ago and their number has been slowly increasing making me really anxious. Even if I would be able to include it as a property in GSC I could rather not disavow my own (parked) domain... I guess the only thing I myself can do now is hope that the backlinks from example.net will start to slowly decrease.

Have you checked the logs for example.net

Regrettably my raw server logs don't include the domain name requested, only the rest of the URL.

you do need to put everything into GSC

Yes, all those four versions of example.com are included as my properties.

penders

10:57 am on Oct 21, 2019 (gmt 0)

The previous rule, which should not be of relevance here is:
RewriteCond %{REQUEST_URI} ^([^.]+\.html)
RewriteRule \.html. https://www.example.com/$1 [R=301,L]

Aside: What do you think that rule is doing? Unless something has been missed from the formatting/exemplification, it doesn't look like it's doing what it should be doing?

hmm... I think I don't need that [HTTP to HTTPS] redirect anymore after penders' simplified code.

Yes, you do. The "simplified code above" does not redirect http://www.example.com/ (HTTP version of your canonical hostname).

Very rarely. I use relative internal links.

Ok, this makes it more puzzling as to how Google found these backlinks.... as the backlinks don't appear to exist (even if example.net was crawled)!

For Googlebot to see a backlink from http://example.net to https://www.example.com, it must have been able to crawl http://example.net (if the redirect failed for whatever reason) and seen links of the form https://www.example.com/foo. If you are using relative links then internal links point to example.net - there is no backlink. (?)

Unless... you are using a BASE tag that references www.example.com?

> (example.net can of course not be added as a property in GSC to enable research.) Why not?

A site example.net doesn't exist. It's just a parked domain. But, admittedly, I haven't tried to add it as a property of mine.

No reason why it can't be added as a property in GSC. In the same way you've added "example.com" (no www).

> Was this previously an active domain?

Never has been. example.net has always been just parked.

If you do a site:example.net search in Google, do you get any results?

In the GSC linked pages report you should be able to drill down to see exactly where the page is being linked from.

Do you have a rel="canonical" tag set on your pages?

Presumably you have no other directives in your .htaccess file (other than the 3 rules mentioned)? Do you have any other .htaccess files in subdirectories?

I would try accessing your site using a Googlebot user-agent, just to make sure you are getting the expected redirect response.

thord

2:09 pm on Oct 21, 2019 (gmt 0)

it doesn't look like it's doing what it should be doing?

So I'll remove it. It was added years ago at a time when it was useful.

For Googlebot to see a backlink from http://example.net to https://www.example.com, it must have been able to crawl http://example.net (if the redirect failed for whatever reason) and seen links

Interesting. In case the redirect has failed it wouldn't have taken Googlebot long time to crawl 100 pages. But, where would Google have seen example.net mentioned so they would know about its existence? Never on my site. WHOIS, of course. On some pages I do have a footer in the style "The address of this page is example.com/foo" and it is hyperlinked, but I didn't consider that as internal absolute linking. There are no BASE tags.

If you do a site:example.net search in Google, do you get any results? In the GSC linked pages report you should be able to drill down to see exactly where the page is being linked from. Do you have a rel="canonical" tag set on your pages?

That site: search gives about the same number of pages that GSC mentions in Top linking sites. What I can see in GSC's report is the so called linking page in site http://example.net. (It's always http, not https, and no www.) But except for that footer URL in some cases there is no absolute link to example.com on those pages. As an experiment I once made an m folder. There are just four files in it, and they have the rel="canonical" tag.

Presumably you have no other directives in your .htaccess file

There are ordinary blocks by IP, referrer and user agent, as well as hotlinking prevention with an exception for Google and Bing. There is no other .htaccess file than the one in the root.

lucy24

8:59 pm on Oct 21, 2019 (gmt 0)

Aside: What do you think that rule is doing?

As written, the effect of the rule is to strip away any and all garbage that might happen to occur after �html�. It's one of a surprising number of rules that most sites will never need, but might become necessary if you detect extraneous stuff entering your .html URLs. (This does not apply to spurious query strings, only to the URLpath itself.) If you never actually see requests in this unwanted form, the rule can safely be deleted, since it's just that extra nano-erg of work for the server.

In order for requests for example.net to be redirected, they have to reach the same htaccess as example.com. Is this a hosting thing, where they charge less for a �parked� domain? Otherwise it should make no difference. In any case it can't hurt to do some random testing: request a few of your normal example.com URLs, only attached to example.net instead. Make sure they get redirected in a single step to https://www.example.com/rest-of-URL.

penders

12:19 am on Oct 22, 2019 (gmt 0)

As written, the effect of the rule is to strip away any and all garbage that might happen to occur after �html�.

That is probably the intention, however, "as written" it will simply redirect to the root as the wrong backreference has been used in the substitution (should be %1, not $1). (Aside: Unless path-info has been explicitly enabled, then the default handler for text/html files should already reject path-info.)

lucy24

1:40 am on Oct 22, 2019 (gmt 0)

the wrong backreference has been used in the substitution

Whoops! So it has.

Unless path-info has been explicitly enabled, then the default handler for text/html files should already reject path-info.

In that case I guess the idea is to avoid Duplicate Content--same as stripping the query string from an html request--since the correct page will otherwise be served at multiple URLs. But, unlike standard variations such as
/directory/
vs.
/directory
�.html-plus-garbage� isn't likely to occur. In particular, it won't be requested by search engines as a part of their routine entrapment operations. I would have also said /directory/index.html, but this no longer seems to be a standard request*--if in fact it ever was. They only check for with/without final directory slash.

* I searched logs for recent requests ending in /index.html, and found absolutely nothing, zero, except from my own IP when I use Fetch's Web View option (which starts by clicking on a physical file that may happen to be named index.html).

thord

5:29 am on Oct 22, 2019 (gmt 0)

Seems like it's not possible to find out why and how Google has indexed pages from example.net. But is there something I can do to diminish the presumed damage (duplicate content risk)? I understand it's hard to get G to deindex pages it has once found, but would it be a good idea to either 403, 404 or 410 all requests for example.net pages? As no human user knows about example.net requests don't have to result in any html page being served (unless there are some reqistry requirements). What about putting the following rule above penders'? Are the [NC] and [L] flags redundant?

RewriteCond %{HTTP_HOST} ^(www\.)?example\.net [NC] 
RewriteRule (.*) - [R=403 404 or 410,L]

thord

7:19 am on Oct 22, 2019 (gmt 0)

@lucy24

Make sure they get redirected in a single step to https://www.example.com/rest-of-URL.

Yes, they do.

lucy24

4:34 pm on Oct 22, 2019 (gmt 0)

Are the [NC] and [L] flags redundant?

The [NC] flag creates a tiny bit of extra work for the server, since it has to start by making the test string and the pattern both lower-case before comparing them against each other. So it should only be used when there is a genuine possibility that a variably cased request might come in. (One can argue about whether this applies to hostnames. Human browsers tend to flatten the casing, regardless of what the user typed or clicked; robots that request EXAMPLE.com are going to be blocked anyway.)

The [L] flag is needed with any rule creating a redirect (301 or the default 302). It is not needed with 400- and 500-class responses (including the shortcuts [G] and [F]) because those carry a built-in implied [L].

What about putting the following rule above

If your hostname canonicalization redirect uses a negative condition (the one that goes !^www\.example\.com$) then you don't need an extra rule for example.net because it has already been covered. And, once again, you do not need the ^(www\.)? part for a domain that doesn't exist. Just example\.net without anchors. Except that, again, you don't need this rule at all.

penders

5:39 pm on Oct 22, 2019 (gmt 0)

Seems like it's not possible to find out why and how Google has indexed pages from example.net.

Google will know about example.net, since Google is a domain registrar and they make it their business to know about newly registered domains. However, for Google to have "indexed" seemingly every page then they must have crawled the site. And/Or the site is being linked to from somewhere.

In the Google SERPs do you see a "description" for each search result?

On some pages I do have a footer in the style "The address of this page is example.com/foo" and it is hyperlinked, but I didn't consider that as internal absolute linking.

Whatever is the value of the HREF attribute is what determines whether it is an "absolute URL" or not.

But is there something I can do to diminish the presumed damage

Make sure the redirect is working properly.

Verify the property in GSC... Check backlinks... Use the Fetch as Google / URL Inspection tool to check what Google is seeing.

Please post your entire .htaccess file - including your "hotlink protection" directives (with the exceptions for Googlebot etc.)

thord

7:18 pm on Oct 22, 2019 (gmt 0)

Thank you lucy24 for explaining it so clearly!
I would like to combine two aforenamed rules into one but feel I need affirmation.

RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule (.*) https://www.example.com/$1 [R=301,L]
RewriteCond %{HTTPS} off
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule (.*) https://www.example.com%{REQUEST_URI} [R=301,L]

However, for Google to have "indexed" seemingly every page ... In the Google SERPs do you see a "description" for each search result?

Google has only indexed (i.e. under example.net) about 5 p.c. of all the pages in example.com, but the number has been slowly increasing. It all started only about one month ago. If I do a site:example.net search I get perfectly normal SERPs with the usual descriptions. But http and no www, as said. But if I then make a search for a unique string from a description for a page on the SERP for example.net I will only get one result: the correct page on example.com.

Use the Fetch as Google / URL Inspection tool

Can't use the Inspect any URL tool in GSC for example.net adresses. The expected return is "URL not in property", but the corresponding page from example.com do return an OK. I'm not willing to register example.net as my property in GSC. I would rather not want G to know anything about example.net. I'm so fed up with this unnecessary mess that I would unregister the domain, if it wouldn't be picked up by someone.
A working hotlinking protection has been in my .htaccess for many many years without any trouble. The http and Bing versions are on separate rows.

RewriteCond %{HTTP_REFERER} !^https://(www\.)?google\.
RewriteRule \.(jpeg?|jpg|gif)$ https://www.example.com/foo.png [NC,R,L]

penders

8:56 pm on Oct 22, 2019 (gmt 0)

I would like to combine two aforenamed rules into one but feel I need affirmation.

The net result of those two rule blocks is the same. If you are looking at optimising then change the "(.*)" regex (in the 2nd / combined) rule to something like "^" instead. (No need for a capturing group and no need to traverse the URL-path since you are using the REQUEST_URI server variable in the substitution.)

...but the number has been slowly increasing.

hmm... that would seem to suggest that something is still not right?

Note that a "site:" search can return results that are otherwise redirected (and would not ordinarily be returned in organic search results) - so the results of a "site:" search alone are not necessarily conclusive.

...I will only get one result: the correct page on example.com.

"Duplicate content" is not (yet) an issue then. (It may never be a real issue.)

I'm not willing to register example.net as my property in GSC. I would rather not want G to know anything about example.net.

Google already knows about "example.net". GSC isn't specifically a tool for getting a site indexed - it is a debugging tool to help resolve issues such as this. As lucy24 stated above, "you do need to put everything into GSC just to tell them which ones not to use.". It can help you spot issues before they become a problem.

A working hotlinking protection has been in my .htaccess for many many years without any trouble. The http and Bing versions are on separate rows.
RewriteCond %{HTTP_REFERER} !^https://(www\.)?google\.
RewriteRule \.(jpeg?|jpg|gif)$ https://www.example.com/foo.png [NC,R,L]

I can only assume this has been taken out of context (missing conditions)? Since, as written, this will effectively block all your site visitors from viewing your (locally served) images. It will also block Googlebot.

The reason for seeing the hotlink protection and the rest of your .htaccess file was to check if there could be something that is preventing Googlebot from seeing the redirect. mod_rewrite directives don't just work in isolation - they can chain together / conflict - the order is important.

[edited by: penders at 9:10 pm (utc) on Oct 22, 2019]

lucy24

8:57 pm on Oct 22, 2019 (gmt 0)

Edit: Timestamp will tell you that penders and I overlapped. Here's hoping we do not wildly contradict each other.

I would like to combine two aforenamed rules into one but feel I need affirmation.

Yes, that looks right ... maybe. YMMV, but I've never been able to get my server to recognize conditions in = followed by a literal string (no anchors, no escaping and so on). I'd stick with the RegEx version, in which the pattern is
!www\.example\.com$
with closing anchor, no equals sign. The purpose of the closing anchor is to handle requests that come in with a port number like :8080. In htaccess in shared hosting this may be a non-issue, but it definitely does no harm.

Since your preferred form is www, an opening anchor isn't strictly necessary, though it may make the rule run a little more efficiently. (This is a general principle when the stuff-to-be-matched happens to come at the very beginning of the test string.)

In the RegEx form (no = sign), literal . periods should be \. escaped. But in this specific situation it would be a non-lethal error, since nothing but a . could occur in those positions.

thord

1:20 pm on Oct 23, 2019 (gmt 0)

I can only assume this has been taken out of context (missing conditions)?

Yes, sorry: RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https://(www\.)?example\.com(/)?.*$

hmm... that would seem to suggest that something is still not right?

Perhaps not so. For me the backlink numbers in the GSC Top linking sites report are getting updated slowly, at least not faster than once a week. And in case the problem has now been rectified I expect those undesired backlinks from example.net to be reduced only slowly.
There is a positive change. But let me summarise what has happened.

1. For a long time I had been noting an irritating referrer in the referring site report of my analytics programme: mail.example.net. Occured irregularly every two months or so. The raw server logs did not reveal anything alarming. It looked like coming from unsuccessful script kiddies, and they abound. My server logs do not show the requested domain, only the rest of the URL.

2. When investigating it occured to me that requesting mail.example.net will result in https://mail.example.net being displayed in the browser's address bar and my ordinary index page's contents at www.example.com being rendered, but without images (i.e. no hotlinking). I use no subdomains. However, requests for www.example.net were correctly rewritten to https://www.example.com.

3. In September I noted, in GSC, example.net as a referring site with a slowly increasing number of backlinks to example.com. Now this required action.

4. After having implemented penders' so called simplified rewrite condition/rule a request for mail.example.net now results in https://www.example.com, exactly as it should. So let us see if my main problem has thus been solved.

It remains a mystery how Google had found out about the parked example.net. Via the registry and then somehow succeeded in crawling "pages"? And why did my previous .htaccess not take care of the subdomain mail? Maybe something is wrong with the first two condition lines in the code at the very top of this thread.

phranque

8:53 pm on Oct 23, 2019 (gmt 0)

It remains a mystery how Google had found out about the parked example.net. Via the registry and then somehow succeeded in crawling "pages"?

i would check your dns configuration for example.net.
most likely it is pointing to the same web server as example.com.
then if you look at the web server configuration you will probably find a wildcard subdomain configuration in place for example.net.

And why did my previous .htaccess not take care of the subdomain mail? Maybe something is wrong with the first two condition lines in the code at the very top of this thread.

i.e.:

RewriteCond %{HTTP_HOST} ^((www\.)?(exampleA|exampleB|exampleC)|example)\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^(www\.)?example\.net [NC]
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

it didn't take care of the mail subdomain because you are trying to list all the bad examples that should get redirected.
instead you should whitelist the one canonical hostname and redirect all others:

RewriteCond %{HTTPS} !=on [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

lucy24

10:19 pm on Oct 23, 2019 (gmt 0)

most likely it is pointing to the same web server as example.com.

Well, if it isn't, there would be no point in having that set of canonicalization redirects in place :)

Edit: Would mail. subdomain even need to be mentioned? htaccess is about http requests, and I wouldn't expect any for mail.example.com unless someone is deliberately typing them in.