Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

HTTPS migration: lots of "Duplicate, URL not selected as canonical"

         

guarriman3

2:08 pm on Jun 3, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi,

In January, I started the migration of one of my websites, with 300,000+ URLs. I think it was everything ok (I changed the internal links from HTTP to HTTPS, the headers' redirection is "301 Moved Permanently", I managed to make it HSTS preloaded, ...).

However, in the Google Search Console, in the HTTP property (not in the HTTPS), there are still today about 95,000 URLs with the following message: "Duplicate, submitted URL not selected as canonical. Status: Excluded", with a date of "Last crawled" of Jun 1, 2020. I thought that this amount was to be reduced with the months, but it was not. It has been fairly constant for months.

I created in January a specific sitemap file for the non-HTTPS property, with 300,000+ URLs ('http://foo.com/whatever', not 'https://foo.com/whatever'), in order to make Google crawl the HTTP URLs and, after seeing that they were 301-redirected, remove them from the crawling list. But Googlebot is still crawling the HTTP URLs and even considering them as "duplicate".

Any similar experience? Thank you very much.

guarriman3

2:08 pm on Aug 2, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi,

Sorry to insist on this thread. The number of URLs under the section "Duplicate, submitted URL not selected as canonical. Status: Excluded" of the HTTP property (not in the HTTPS) is increasing.

As commented in June there were 95k, and now they are 160k. And the date of the "Last crawled" is Jul 27, 2020.

I do no know if this must be so and I must wait for the figure to be 300k (the whole number of URLs of my website), but it's a bit confusing that Google is considering my HTTP URLs as "duplicate".

Thank you.

not2easy

3:00 pm on Aug 2, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Those old URLs are duplicates, of the new. Be grateful that they appear to be selecting the HTTPS URLs to index.

You should not be submitting a sitemap for URLs you really do not want indexed because the submission of a sitemap indicates to Google that you want them indexed. The fact that they are being redirected will not remove them from attempts to crawl when there is a sitemap requesting their crawl.

I was going to reference Google's page where they stated this information: [support.google.com...] but they are now linking that section of the Guidelines directly to sitemaps.org: [sitemaps.org...]
Where it says:
What are Sitemaps?

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

So if I were trying to get them looking at the https: URLs I would stop asking them to look at the http: URLs. I have migrated many sites to https and always got rid of old sitemaps.

If you have not set up the "new" (https) site in GSC, you should do that. (from your post, it appears that you have done that) You can leave the old http site in GSC to monitor as they stop crawling and possibly indexing the old URLs - those can be competing with your new URLs. I would also verify manually that the old URLs all redirect to their new equivalent pages - not all redirecting to the new https homepage. It sounds as if you would only need to get rid of the old http:sitemap to speed up re-indexing.

lucy24

5:57 pm on Aug 2, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



in the HTTP property (not in the HTTPS), there are still today about 95,000 URLs with the following message: "Duplicate, submitted URL not selected as canonical. Status: Excluded"
Isn’t that exactly what you want to see? “We know this URL exists, and we also know it isn’t the preferred one. Just sayin’.”

Search engines, including Google, will never stop crawling HTTP URLs, no matter how long the site has been HTTPS. I stress: NEVER. The frequency will drop off over the years, but that’s all you can hope for.

When a human visitor comes in from a search engine, they should be pointed directly to HTTPS, not redirected from HTTP. This change should be largely completed within a few weeks, depending on site; it doesn’t happen all at once. You can’t use GSC to check this. Only your access logs will tell you. Spot-check the HTTP logs--assuming they are separate from HTTPS logs--and make sure there aren't a lot of 301 requests with Google (or other search engine) as referer.

System

1:49 pm on Aug 5, 2020 (gmt 0)

redhat



Mods Note:
The discussion about redirecting URLs was moved to the Apache forum and can be found in a new thread HTTPS migration questions [webmasterworld.com]

[edited by: not2easy at 1:42 pm (utc) on Aug 6, 2020]
[edit reason] fixed the link [/edit]