Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

HTTP to HTTPS migration and tons of "Duplicate without canonical"

         

guarriman3

11:37 am on Feb 23, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi,

I hope this experience could be be useful for other users of this forum. I've got a website with 2,000,000+ URLs, and I migrated from HTTP to HTTPS in early 2019.

I suffered a 50% drop of my visits in the November-2019 update, and browsing the 'Coverage>Excluded' section of the Google Search Console of the HTTP property I've found:
  • 100,000 URLs with "Duplicate without user-selected canonical"
  • 80,000 URLs with "Duplicate, Google chose different canonical than user"

    These 'duplicated URLs' do not appear in the Search Console of the HTTPS property.

    It just so happens that the last date of "last crawled" of the two groups of 'duplicated URLs' is "Nov 7, 2019", just the day before of the penalization. It seems that Googlebot decided not to crawl any more 'duplicated URLs' of my website.

    As a clue (not sure if this is linked or not), I've just found that I made a mistake in the HTTP-to-HTTPS migration of early 2019. Instead of setting a 301 redirection from "http: //whatever" to "https: //whatever", I set a 302. I've just fixed it.

    Now I have some questions for the forum:
  • Do you think that the 302 mistake is linked to the existence of 'duplicated URLs' and the penalization? Or it may be linked to the existence of a real 'thin content' issue?
  • How could I tell Google to crawl again such 'duplicated URLs'?

    Thank you!
  • not2easy

    1:29 pm on Feb 23, 2020 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    A 302 (temporary) is the Apache default, so unless you add a 301 flag to the rewrite on Apache, it will be a 302 which leaves it up to Google to decide which version to index. Yes, that is related to the problem, it is pretty much the cause of the problem.

    If you have added the
    [R=301,L] 
    flag to the rewrite rule, that would automatically correct the problem. If you used ControlPanel to rewrite your URLs to https, their default is also 302 unless you specify 301.

    If your sitemap is using https: URLs then you don't need to worry about telling Google - unless you haven't added the "new" https domain to GSC.

    If done correctly, Google should not be able to access the old http: URLs. You can test that yourself. Just paste the old http: URL in your browser's address bar and see if it does not automatically take you to the https: version.

    guarriman3

    5:26 pm on Feb 23, 2020 (gmt 0)

    10+ Year Member Top Contributors Of The Month



    Hi @not2easy,

    Thank you very much for your kind answer. In fact, the problem with the 302 redirection was not in the Apache server (where I had properly configurated the 301 redirection), but with CloudFlare (misconfiguration of HTTPS). I did fix it :-)

    Do you think that Googlebot will crawl again the URLs within the HTTP property, or at least Google will remove the 'duplicate URLs' of the HTTP property?

    In your opinion, the misconfiguration of the 302 redirection may have led to the SERP penalization?

    not2easy

    6:05 pm on Feb 23, 2020 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    If they have indexed any http: URLs they will re-crawl them and notice immediately that they redirect (permanently) to the new https: URLs. It helps if you only submit a sitemap with https: URLs, if you submit a sitemap.

    Yes, a 302 (temporary) redirect will confuse the crawl evaluation. If you can no longer visit the old http: URLs, then Google can't see them either but as mentioned, if they are indexed as http: they will attempt to re-crawl.

    In GSC, if they show any http: URLs you can click to 'submit for validation' after you are sure those URLs cannot be reached. It will be corrected automatically if only the new version can be reached and if it is now being reached via 301.