Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Soft 404 302 in GSC after HTTPS transition

         

Iskandrian

3:53 pm on Mar 17, 2016 (gmt 0)

10+ Year Member



Soft 404 302 in GSC after HTTPS transition

I recently transitioned my self-hosted WordPress blog from http to https and everything seems to have gone well with one exception. GSC is now producing dozens of Soft 404 302 Found redirects for pages which are otherwise good, visible 200 pages to browser and header checker alike. I'll be happy to reproduce that header and any other useful data here as requested.

Basically:

- all internal WordPress settings are correct, WordPress address and Site address set to https
- all internal database references are https
- permalinks are pretty day and name
- .htaccess WordPress canonical block is correct (as far as I can tell)
- all external redirects from http to https work perfectly
- no redirects by me of missing or other pages to home page
- all accesses other than Fetch as (including Googlebot itself, in logs, for its own Soft 404-flagged posts) return good URLS with 200 headers

Nevertheless, GSC keeps producing selective Soft 404s with a 302 Found pointing back to Location: / using an ugly permalink and referencing wp-json and the WordPress API

I realize there are far too many variables here to expect any direct solution, but I'm interested if anyone has experienced anything similar and/or can suggest a logical place to look for the cause of this problem. Frankly, I've tried everything I can find out about, to no avail. Again, I'll post any header or other data which might be useful.

TIA


[edited by: Robert_Charlton at 7:45 pm (utc) on Mar 17, 2016]

Andy Langton

9:30 pm on Mar 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can you give an exemplified version (e.g. example.com/wp/slug) of the requests resulting in a 404? I'm unclear on this comment:

GSC keeps producing selective Soft 404s with a 302 Found pointing back to Location: / using an ugly permalink and referencing wp-json and the WordPress API

Iskandrian

9:57 pm on Mar 17, 2016 (gmt 0)

10+ Year Member



Sure. Here's the URL Googlebot tried to crawl:

https://www.example.com/2015/03/04/pollo-estofado/

and here's the redirected 302 header it returned resulting in a Soft 404 error

HTTP/1.1 302 Found
Date: Thu, 17 Mar 2016 04:29:55 GMT
Server: Apache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Link: <https://www.example.com/wp-json/>; rel="https://api.w.org/", <https://www.example.com/?p=82856>; rel=shortlink
X-Frame-Options: SAMEORIGIN
Location: /
Cache-Control: max-age=1, private, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Length: 761
Keep-Alive: timeout=2, max=99
Connection: Keep-Alive

Here's the same URL run through SEOBook's server header checker just now using its Googlebot UA

HTTP/1.1 200 OK
Date: Thu, 17 Mar 2016 21:51:38 GMT
Server: Apache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: max-age=1, private, must-revalidate
Pragma: no-cache
Link: ; rel="https://api.w.org/", ; rel=shortlink
X-Frame-Options: SAMEORIGIN
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8


[edited by: Robert_Charlton at 11:12 pm (utc) on Mar 17, 2016]
[edit reason] exemplified domain per forum Charter [/edit]

Iskandrian

12:45 am on Mar 18, 2016 (gmt 0)

10+ Year Member



Thanks, Robert. I wasn't quite sure what exemplified was meant to refer to when I posted this and, having read the charter now, I was about to delete the whole thing. Going forward, I believe I can describe phenomena as conceptually as needed to avoid unwanted detail.

aakk9999

3:41 pm on Mar 21, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




HTTP/1.1 302 Found
(etc...)
Location: /
(etc...)
So you get the above response if you perform "fetch as googlebot" from GSC?

If this is what Googlebot is getting then there is no wonder that it thinks it is Soft 404 because the two lines I have highlighted above, which is the response Googlebot is getting means:
"This page has temporary moved to domain root"

Regarding the response to googlebot - just to clarify - it is not Googlebot that returns the response, it is YOUR SERVER that returns the response. So I would look at your .htaccess file to see if there is something that is specifically done based on the user agent being Googlebot or based on Googlebot IP range.

There is also a possibility that there has been code injection in your wordpress that does the same (responds with 302 to Googlebot based on either User Agent or IP range).

You may also want to re-try the request with changing the User Agent to Googlebot and see what result you get (search for "change user agent" which will give you a number of ways to do this). If you do not get this redirect then it is still possible that the change is based on Googlebot IP.

Iskandrian

7:15 pm on Mar 21, 2016 (gmt 0)

10+ Year Member



Thank you very much for your reply. Unfortunately, I have already tried what you recommend (e.g., removed fake Googlebot-trapping and other code from .htaccess, run header checks with various UAs, etc.) and the failures of those efforts were what brought me here seeking further insight. For example, several of the commonly recommended Apache rewrite rules to redirect from http to https including ones offered by my own host - and different from the one NetMeg is offering on a neighboring thread, btw - either include at the end only [R, L] or nothing at all rather than the specific [R=301, L], thus triggering a 302 redirect instead. But I have checked and have no such code in .htaccess either. Nor do I have any of what is still being commonly recommended for both true 404s and WordPress comment-page-1 failures (supposedly under repair in the WP garage) alike, i.e., explicit redirects to the home page in .htaccess.

My host at my panel level offers the choice (which I arbitrarily elected) to direct to www.example.com rather than just example.com or both. Is it conceivable that this choice made there in the host panel rather than in .htaccess could be triggering this random redirect? I don't understand how myself.

Most curious to me is that this dog only barks relatively infrequently in the night. That is, although I am repeatedly getting dozens of these Googlebot-only Soft 404 302 Found redirects to my home page in GSC, those are out of hundreds and hundreds of successful 200 crawl requests in my logs, that is, whatever is triggering Googlebot's response is not at all consistent and is also relatively infrequent. Thus, if the culprit is some script or code element, it is one which operates inconstantly. Nor can I match it to anything obvious among my posts (this is a WordPress blog).

For example, there are numerous suggestions out there that Google signals what it regards as "thin content" by returning a Soft 404 in GSC. Now, granted, a Google Soft 404 can be anything that Google decides a Google Soft 404 is, because it is they who conceived that condition of sin; thus a Soft 404 can conceivably just as easily be a page Google arbitrarily frowns upon as it can be truly missing page references explicitly redirected to the home page in an .htaccess rule. But these Googlebot-only 302 Found redirects to Home page I'm seeing span everything from noindexed very thin posts and good, long, complex posts alike and everything in between. That is, there is no common element obvious to me triggering the response in question.

Which is what brought me here. I know very generally what I should be looking for in the way of response triggers, but what more specifically, that is, what - apart from arbitrary Googleness - more specifically would trigger that specific header response, in Googlebot but not in browsers or several different header checkers running a variety of UAs including a spoofed Googebot; inconstantly; in WordPress; following a transition from http to https using my host's LetsEncrypt certificate?

Given the lag times in GSC reporting, trying to check such things as active/inactive plugin or other cause/effect becomes from difficult to nigh impossible. I had previously used NetMeg's redirect rewrite rules from the recent http to htaccess thread here but the phenomenon persisted. I have presently replaced them with this offering from my own host

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule (.*) [%{HTTP_HOST}%{REQUEST_URI}...] [R=301,L]

although, as mentioned, I added the missing [R=301,L]. Frankly, I'm not expecting this to make much of a difference either.

I'm afraid what is needed is some sort of Sherlockian understanding and insight into this universe greater than I possess, of the sort "look for the hidden letters near a boiler or other hot water heater", something that will narrow down this specific, inconstant, header injection.

Again, thanks for everyone's help so far, and any further clues will be greatly appreciated. I doubt I am the only one who is or will be having this problem.

lucy24

8:17 pm on Mar 21, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



either include at the end only [R, L] or nothing at all rather than the specific [R=301, L], thus triggering a 302 redirect instead

That's just a flag. Change it. The rest of the rule is unaffected-- although frankly if you're seeing rules with the wrong flag, you have to look carefully and make sure there's nothing else wrong with the rule.

My host at my panel level offers the choice

I wouldn't. A server-level redirect happens before the request reaches your own htaccess, so you'll get chained redirects for no reason. It's trivial to add domain-name canonicalization to your own list of redirects. As a matter of fact, you can handily combine it with the new https redirect, like this:
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

Iskandrian

9:35 pm on Mar 21, 2016 (gmt 0)

10+ Year Member



That's just a flag. Change it.

Yes, I did.

I wouldn't.


My overall inclination is usually to have everything neatly in one place, particularly since Google frowns on excessive redirects and since now it turns out there is an additional http --> https redirect baked in at my original WordPress walled garden home, but until now I was hesitant to add too many variables to the stew.

Very well, I've just implemented these changes, and only time will tell (by the absence of future phenomena) whether this is a contributing factor, notwithstanding that it provides worthwhile elegance in its own right.

One additional consideration: WordPress, in Settings | General, has two options, WordPress Address (URL) and Site Address (URL). Per all extant conventional wisdom I changed these from http to https upon my https transition. With respect to these two most recent changes, though, (now, no host panel redirect; new combined rewrite rule), are those existing settings innocuous or will they need to be changed as well to avoid conflicts?

Thanks, Lucy. Let us see what Googlebot thinks.

Iskandrian

2:22 am on Mar 23, 2016 (gmt 0)

10+ Year Member



No, it seems that wasn't the problem, although, as mentioned, it did achieve a desirable new elegance.

I'm beginning to think this is either merely something odd and idiosyncratic to GSC itself along the lines of other errors Gary Illyes has previously admitted to or perhaps a phenomenon associated with the new WordPress API. For example, Google produced another ostensible Soft 404 non-existent URL which when then manually fetched as Google resolved directly and successfully as a 200. In addition, hitherto never before seen (at least by me) 302 Not Followed errors (a redirect Googlebot claims is empty) are being produced and recorded in GSC for URLS which are nevertheless perfectly good and reachable.

Thank you to everyone who tried to help nevertheless.