Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google selects unrelated URLs as canonical

         

MayankParmar

2:59 pm on Sep 25, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Starting September 22, some of my new posts stopped appearing on Google. On September 23, some old posts also disappeared. Since 24th, I'm unable to get any article indexed by Google.

When I used the inspection tool in the console, I discovered that Google started selecting completely unrelated URLs as canonical URLs. I have defined the URLs properly in the source code, but Google is still picking incorrect URLs, completely unrelated ones which aren't even linked to each other.

For example, an article on Microsoft <unrelated product> has been canonicalized to Windows 10's Start Menu post (months old post). Other two articles got canonicalized to the same Windows 10's Start Menu article. Unrelated topics and pages which are not even linked in the source code have been canonicalized (per my own tests and inspection tool live test).

I have a complete list of URLs and examples posted here: <snip>
See mod's note below.

Here's what I have done so far:

- Scanned URLs using Google's AMP, mobile, rich snippets, etc testing tools. No issues.
- I used request indexing option, but it didn't work.
- Cleared Kinsta and Sucuri cache and resubmitted the sitemap.
- Checked theme, plugins and other files for any recent modifications. No files were modified.
- Scanned the site using a malware scanner. No issues.
- Cleared, disabled Sucuri cache. Turned off AMP and Disqus comments for the last two articles. That also didn't work.

Any suggestions/ideas on what could be fixed? The site has a great backlink profile and is often linked for original/exclusive reporting.

Thank you


[edited by: Robert_Charlton at 12:24 pm (utc) on Sep 26, 2020]
[edit reason] Removed violations of Forum Charter and TOS. Will explain in post below. [/edit]

Robert Charlton

2:10 pm on Sep 26, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



MayankParmar, sorry to hear you're having problems, as I know you've put a lot of work into your site.

I'm also sorry I needed to remove some specifics from your post... in this case a specific product name, and also a link to a Google support thread where your site is linked and undergoing public review. For a great many reasons, we simply don't allow linking to specific or public site reviews anywhere in the public areas of WebmasterWorld... for everybody's protection, and to avoid spam and promotion... so we also avoid linking to parallel site reviews with specifics on other forums.

In deciding how to handle the link, I did look at the Google support discussion... which doesn't appear to be very helpful, but you did mention something there which you didn't mention here which is very important... and that's the specific error message you received.

Error is "Duplicate, submitted URL not selected as canonical"..

When I'm working on site that receives an error message from Google, virtually the first thing I do is to do a quoted search of the message.

That search strikes gold almost immediately. The result for my quoted search turned up this post by Barry on seroundtable, from almost two years ago, with a response from John Mueller. In this example, a search for the title you posted on this thread would also turn up Barry's SER article...

Google Getting Your Canonical URLs Wrong?
Dec 26, 2018
by Barry Schwartz
https://www.seroundtable.com/google-canonical-urls-wrong-26872.html

First, Barry's intro...
The specific error [the webmaster] is seeing from Google Search Console is "Duplicate, submitted URL not selected as canonical." He said "The problem is that the two pages are not duplicates and google selected canonical does not match the user selected canonical."

Google's John Mueller came in on Christmas eve and responded:

"Usually this happens when we run across a number of URL patterns on a site that all lead to substantially the same content. If this all happened during a short time, it might be that there was something misconfigured that caused this, and in that case, it'll settle back down over time as our algorithms confirm that these URLs are actually separate. That said, I agree that this looks really confusing, so I also forwarded these to the team to check out, in case there's something on our side which we can do to speed things up for you :). "

This immediately brought to mind your post(s), which talk about a lot of loosely related or unrelated articles (in or around Microsoft products) getting canonicalized to the Windows 10's Start Menu post.

Your problem also is not isolated... it's been around for years, not exactly a "bug", but it is a recurring indexing problem that Google ought to sort out. None of the posts I've seen about the issue gets to the heart of the problem, which is that the CMS's used probably introduce a lot of dupe content, essentially the same articles or article snippets sorted in different ways.

Aain, as John put it...
"Usually this happens when we run across a number of URL patterns on a site that all lead to substantially the same content."

Wordpress architecture immediately comes to mind, and you suggest that you've got a Wordpress site. Many possibilites for blocks of text appearing over and over throughout the various topical and chronological categories, essentially cross-linking too often, rather than being a more focused, more granular taxonomy. I can see that wording on different types and levels of pages being repeat often enough it's initially a blur to Google. If you've got global cross-linking, rather than prioritized related content linking, with vocabulary on the Window10 that's probably a good part of your difficulty right there.

There's also a chance that some of your articles have been scraped or syndicated enough that a kind of sameness prevades the site and blurs everything. As John says, "it'll settle back down over time as our algorithms confirm that these URLs are actually separate."

It could be that Google is going through some other things as well.. Somewhere, I remember seeing a greater desire on Google's part for uniqueness, a different approach to content.... maybe hard to do when you and your competitors are all looking at the same product line.

Anyway, those are my quick thoughts, too late at night. I hope this helps.

MayankParmar

3:54 pm on Sep 26, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Thanks for the detailed reply!

Canonicals are changing every day. Google deindexed "Windows 10's Start Menu" post too and it has been canonicalized to a different topic on 'Surface Book'. It's a very odd canonicalization chain and articles are disappearing. Based on mobile usability report, more than 1300+ articles are now deindexed. The coverage report is delayed, so I don't know how many URLs are affected.

I don't use Tag pages. I don't have subpages of categories indexed and I also have noindex applied to all subpages of the homepage since 2018. I have made every possible effort to avoid duplication.

This started all of a sudden on 22nd when the coverage report was last updated.

Really not sure what to fix... it's not just a section of the site that is affected. The whole site is in a canonicalization chain.

I don't have any unnatural/automated internal linking pattern. All are manually linked.

Deindexed articles have natural backlinks too. In fact, the whole site has a great link profile and is reputed in the niche.

JorgeV

12:14 pm on Sep 27, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

Keep things simple, one page = one URL.

Robert Charlton

10:23 pm on Oct 2, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Note: The following info was posted by WebmasterWorld admin phranque in our SERPs and Updates thread on Oct 1....
------
from a @searchliaison twitter thread [twitter.com]:
We are currently working to resolve two separate indexing issues that have impacted some URLs. One is with mobile-indexing. The other is with canonicalization, how we detect and handle duplicate content. In either case, pages might not be indexed....

Robert Charlton

3:54 am on Oct 3, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



MayankParmar, I was glad to learn also, from the Updates thread, that your pages are coming back. Great news!

While I don't know precisely what Google is doing or what their problem was, what they said about what they are fixing, IMO, does suggest that you are probably on the edge of dupe content issues. Here's Google's statement, with my emphasis added...

We are currently working to resolve two separate indexing issues that have impacted some URLs. One is with mobile-indexing. The other is with canonicalization, how we detect and handle duplicate content.


As I'd noted in my post earlier in this thread...
...None of the posts I've seen about the issue gets to the heart of the problem, which is that the CMS's used probably introduce a lot of dupe content, essentially the same articles or article snippets sorted in different ways....


For years on WebmasterWorld, we've had discussions about WP in particular, but about the dupe content issues added by all of the various CMSs. Mostly, the fixes involved...
- dropping some superfluous pages in WP
- and paraphrasing snippets.
As Google became accustomed to WP, etc, my guess is that it also adapted its dupe criteria to fit.

But now... and this is conjecture... with machine learning evaluating duplication and sameness... not just by the exact text on a page, but also by its meaning... I'm thinking that Google might have gone too far in the direction of meaning, and that the sense of a page as well as the literal text got involved in the canonicalization process.

The issues are complex enough that we can only guess... but I'm guessing that, for whatever reason, the sites affected most were sites on the edge of too much internal duplication... and very probably many of them were WordPress.

not2easy

4:27 am on Oct 3, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



WP does a great job of offering several ways to see the same content, useful maybe for people visiting complex sites. Less useful for sorting what belongs in the index. If Google has trouble with retrieving the canonical URL for an article then there will be trouble with indexing the selected pages and posts, no doubt.

Once you have determined which version of your content you want indexed and made sure that content has one canonical URL and that only that URL is submitted for indexing, the rest is up to Google.

It sounds like there's been a glitch with Google because it does not make sense that sites which were doing well suddenly started seeing the same canonical errors and duplicate content alerts. In this situation, it is best to wait it out. If you set it up right, wait for them to sort it out. If you aren't sure it is set up right, examine the URL syntax variations and look at canonical information in the page source header. If you are seeing correct information, leave it alone.

JorgeV

12:29 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

Indirectly related, but having the same content from different URLs, even with canonical, is exhausting "un-necessarily" your crawl budget.

MayankParmar

12:45 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Interesting points, thank you all :)

I'll definitely take a look and see what can be fixed. Currently, my subpages are deindexed to improve the site's overall quality, but that will not stop the Googlebots from crawling the pages.

I also have mega menus and I believe Google is good at detecting navigation, but it looks like the theme is not using <nav class to declare the menus. I'll ask the developers if that can be improved.

Meanwhile, I'll do another audit to surface problems and get them fixed.

I do have other plans to improve the quality of the site by removing bloat code, but I want to make changes only after the core update is released. Waiting patiently :)

JesterMagic

1:42 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robert brought up some good points.

does suggest that you are probably on the edge of dupe content issues


So what do you feel the dup problem is MayankParmar?

Is it category pages that display a snippet of each article that may be causing the duplicate content issue? Are these snippets too large?

Or is it another issue? For example to reduce duplicate content issues I not only have the canonical link in my source (which I know you do as well) but my website is setup so each article has only one url (not including some query variables for misc things). My articles belong to multiple Topics/Categories but I don't rewrite the URL to include the category in the URL like some sites do. The only case where our articles display using a completely different URL is our print pages and these we have used NOINDEX.

MayankParmar

2:01 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



I am not sure where the problem is... but pages are back in the index after Google fixed it, so it could be their fault entirely or they noticed something common (duplication) across these sites and wrongly interpreted it? Not sure.

Category pages display 2.5 lines snippet, while homepage and homepage subpages display 3-4 lines snippet on desktop. On mobile devices, only title is displayed (no snippets) and no mega menus on mobile. Subpages are deindexed.

Also, the author and tag pages are completely deindexed.

I'll also try to remove the unnecessary category pages and posts.

Robert Charlton

2:56 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Mayank... IMO, megamenus can be a major problem. I feel that they essentially destroy search focus on a site. To be discussed later, as for now I think that noindexing or "deindexing" of pages is probably a much larger problem. For now, just some brief initial comments regarding what you wrote.

First, is it "noindex" you're doing or "deindexing" (ie, using the page removal tool)? I'm going to guess you meant "noindex", when you wrote the following...
I'll definitely take a look and see what can be fixed. Currently, my subpages are deindexed to improve the site's overall quality, but that will not stop the Googlebots from crawling the pages
I don't think that crawl budget is an issue, if that's what you're thinking when you mention Googlebots.

There is another issue with noindex, though, which may not be doing what you think. For a while, it had been a common practice to use robots "noindex,follow" on thin or dupe pages to keep them out of the index.

Google, though, likes its indexing practices to reflect user experience (a very important principle to keep in mind), and decided to essentially disable this feature as a longterm fix. Noindex behavior was changed so it would honor the noindex declaration for a short time, but after an unspecified period would stop following the links on those pages... and also would treat the pages as 404s.

Therefore, it's not a good idea to have noindexed pages in your site. Here's a thread in which this was introduced.. Note that "Will" in the title is British English and doesn't mean the disabling is eventually to be coming from Google... Rather, it means that's what you can expect 'will' happen now....

Google Will Eventually Stop Following Links on Noindex Pages
Dec 2017
https://www.webmasterworld.com/google/4881752.htm [webmasterworld.com]

My guess is that if you have much dupe content (and/or frequently repeated chunks of text on pages that are not otherwise dupes), but rather just different WordPress sort pages... in combination also with megamenus, and you are also using "noindex" on some pages to 'sculpt' your site, you're most likely in problem territory.

Additionally, in the noindex thread I cite above, read the John Mueller statements I quote in my Dec 17, 2018 post near the end of that thread, which goes into possible problems mixing rel canonicals and noindex, and how Google looks at these differently.

So, all of the above. I don't know for sure how much of this combination, if any, has to do with your disappearance of canonical pages (which Google is now fixing), but I hope you're following my thoughts about why I think what you're doing could be really confusing tp Google, and not helping your site long-term.

not2easy

3:30 pm on Oct 3, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The whole point of using canonical meta tags on WP sites is because you cannot disable WP from creating various URLs for your content. Google is aware of this URL syntax in WP, so there's no reason to pretend it is a static site using 'noindex'. You create a page and assign it to a category and (depending on your permalink syntax) it will be found under its original URL, the category URL, the archives URL, and if any tags are used to assist in search, there will also be a tag page URL, or even multiple tag pages' URLs. It is the way WP works. With a canonical meta tag to a single page for all those various URLs we hope that Google will index that one version.

If your sitemaps include all those other variations, it may confuse your intentions for indexing. The recent 'canonical confusion' issue has been admitted as an error at Google so if you keep making changes, it can only take longer for this to be sorted out. IF it was OK before, let it settle back to that.