Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Misuse of rel=canonical results in mega FAIL; help to fix?

         

domino66

5:19 pm on Jan 2, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Oh man. Researching a pagination issue, I realized that I think 80% of my content-rich Q&A website has been made invisible to Google, and I need help to fix it.

Here's what happened: Each of my site's Q&A's is with an expert in various fields, and I used to publish them all in a single-page view. with all Q/A content on a single page. Due to Google indexing a few variants of the same URL, I implemented the rel=canonical attribute so that my Opera Singer Q&A code included:
<link rel="canonical" href="http://mysite.com/opera-singer"/>

So far, so good. But last Summer I decided to paginate the Q&A's with 7 Q/A's on each page b/c some of them had 100+ Qs and took too long to load. So the paginated URL structure became:
Page 1: http://mysite.com/opera-singer
Page 2: http://mysite.com/opera-singer/2
Page 3: http://mysite.com/opera-singer/3
etc, etc.

Here's the megaFAIL that I just discovered. the same canonical link above (pointing to http://mysite.com/opera-singer -- page 1 of the now-paginated Q&A -- was replicated to each /2, /3, /4 paginated page(!) So correct me if I'm wrong, but I believe the net effect was that I was literally telling Google's spider to IGNORE the Q&A content that wasn't on page 1 of the now-paginated Q&A...so 80% of my content-rich site is essentially invisible to Google(!) The "proof" is that if I search Google for any text string on page 1 of a Q&A it finds it, but does NOT find any text string on pages 2+.

So now for my questions:

1) Is my interpretation of what's happening and why correct? IOW, did my (mis)use of the rel=canonical attribute tell Google to ignore all content not on paginated page 1?

2) I changed the code for each paginated page to reflect the new paginated URL structure...so URL http://mysite.com/opera-singer/2 now uses <link rel="canonical" href="http://mysite.com/opera-singer/2"/>, and so on...and then I went to WMT and forced a recrawl of the entire site. Is there anything else I can do? I'm a little concerned because it's been 5 days, and Google still can't find any of my content that's NOT on the first page of a Q&A. Is it possible that simply changing the canonical links as described and recrawling will NOT work because Google has become permanently 'blind' to them, since for the last 6 months I had been telling it to ignore them? Is it 'smart' enough to realize 'Hey, I'm no longer being told to ignore this content, and wow it's entirely new and different than what this page's canonical link had previously been pointing me to'?

Planet13

6:44 pm on Jan 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I will let others weigh in regarding the technical issues.

I would instead focus on the user interface issues.

I don't think having an article with 100 Q & As with ANYBODY is particularly user friendly, and I don't think that splitting them up into 7 Q & As per page makes it particularly more user friendly.

If it were me, I would take those 100 Q & As and find all of the ones that relate to a specific topic of interest, no matter how many Q & As that leaves on a page.

Using your example above, you might have one page that covers tips by the singer on breath control. Antother page might be business advice with tips by them on how they got their start in opera and how they kept their career growing.

Another thought is to not base the Q & A's soley by person, but by subject. So you could have a page on breath control and have Q and A's from ALL of the opera singers that you have interviewed.

Hope this helps and apologies in advance if it doesn't.

FranticFish

6:50 pm on Jan 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1) Yes. From what I've read rel=canonical is more of a guideline than a directive, but if you can't find the paginated pages in the index then I think it's safe to assume that's what happened.

2) You can 'fetch as Googlebot' to make sure you get a response on those pages and if you do then it's a question of waiting.

Has Google become permanently blind to those urls? That would not match behaviour I've seen from my own crawling and indexing mishaps.

domino66

6:51 pm on Jan 2, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



[reply to Planet13]

Thanks for the reply -- I didn't want to get too much into the nature of my Q&A site, since I know I can't link to it, but while I understand your points about long Q&A's, my site is designed like one of Reddit's 'AMAs'...so users can continue to ask questions of expert hosts as long as they wish as long as the host continues to answer (hence several very long Q&A's). So shortening or curating the Q&A's is not really something I'm looking to do.

So for now, I'd really like to focus on the SEO / URL issue, since reversing the damage I've done by 'hiding' 80% of the site from search engines is my #1 priority right now. Thanks again, though, your suggestions aren't bad ones, just not a priority for now.

[reply to FranticFish]

Thanks for the reply.

If by 'Fetch as Googlebot', you're referring to the Fetch tool in WMT, yes, I did that first for my home page mysite.com (selecting crawl all internally linked pages too), and then just to be sure I did it again on one of the 'invisible' pages mysite.com/opera-singer/2...and for both it returned status "Complete" and "URL and linked pages submitted to Index"...sounds like that's what you were recommending and that it completed successfully.

So the rel=canonical being more of a SUGGESTION than a directive is something I thought too, because Google often assures everyone that their algorithm is very smart and will crawl/index things even if they're not optimally structured. So if it were just a 'suggestion', I'm shocked that Google's crawler wouldn't have immediately recognized that the content on paginated page 2 of 7 was completely different than the canonical link to page 1/7, and indexed it anyway(!)

aakk9999

11:21 pm on Jan 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Misuse of rel=canonical can tank the site very quickly. Do not rely on Google to get it right - from my experience in most cases Google pretty much blindly follows it rather than analysing whether it makes sense, whether the pages are the same/similar, etc.

Google published this couple of years ago and it is worth reading:

5 common mistakes with rel=canonical
Official Webmaster Central Blog, April 2013
http://googlewebmastercentral.blogspot.co.uk/2013/04/5-common-mistakes-with-relcanonical.html [googlewebmastercentral.blogspot.co.uk]

Domino66, yours is listed on there as Mistake #1 :(

But what you did (replaced canonical and Fetched and re-submitted) should rectify it, you just need some patience for Google to digest these changes.

lucy24

1:28 am on Jan 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I believe the net effect was that I was literally telling Google's spider to IGNORE the Q&A content that wasn't on page 1 of the now-paginated Q&A

Obligatory observation: You weren't telling the Googlebot not to crawl. You were telling the Google computer not to index.

domino66

2:15 am on Jan 3, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks, aakk9999, good link. So yes, it appears that Mistake #1 describes verbatim what happened to my site. Really hoping that fixing/re-indexing fixes this, given that we're an organic search beast. Really hard to believe that Google simply ignores pages 2, 3, etc when there's absolutely zero overlap in content...but that's what that Google blog post implies happens (even though in other instances they tout their algorithm's ability to do a good job even with slight misconfiguration.)

domino66

8:22 pm on Jan 8, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



OK, I've spent a week reading & posting here and in other forums, and here's my proposed markup to help Google crawl and index my paginated Q&A site optimally. Can people please comment on whether it looks good?

Problem:
  • My site features Q&A with experts in various fields who answer anything users want to ask (like Reddit AMA), and I paginate content to 7 Qs/As per page.
  • There is also a View All link at the bottom of each paginated page, but that single-page view can be very slow to load b/c it's sometimes hundreds of Q/A's long, so I provide it as an option, but do NOT want Google to index it or direct users there. I want organic search users sent to the specific paginated page that contains the content relevant to their search.
  • I had originally committed Pagination Mistake #1 -- as described in this Google Blog post > [goo.gl...] -- by including a rel=canonical link on EVERY paginated page that pointed to page 1 of the series. The net effect was that Google literally did not index any content that did not appear on the first page of a Q&A (over 80% of my content-rich UGC site!) I'm now trying to undo that damage and get all of my site content re-indexed ASAP.

    Here's what I've settled on:

    Page 1 of a paginated Firefighter Q&A.
      URL: http://example.com/firefighter/1
      Canonical: <link rel="canonical" href="http://example.com/firefighter/1" />
      Prev/Next: <link rel="next" href="http://example.com/firefighter/2"/>

    Page 2 of the paginated Firefighter Q&A.
      URL: http://example.com/firefighter/2
      Canonical: <link rel="canonical" href="http://example.com/firefighter/2" />
      Prev/Next:
      <link rel="prev" ref="href="http://example.com/firefighter/1"/>
      <link rel="next" href="href="http://example.com/firefighter/3"/>

    Last page (e.g. page 7/7) of the Firefighter Q&A series.
      URL: http://example.com/firefighter/7
      Canonical: <link rel="canonical" href="http://example.com/firefighter/7" />
      Prev/Next: <link rel="prev" href="http://example.com/firefighter/6"/>

    View All (aka single-page view) of the entire Q&A
    (As explained above, I provide this as a convenience for users but b/c these pages can be very long / slow to load, it's not an optimal user experience so I don't want Google indexing these pages or directing users here, which a Google Blog post, [goo.gl...] , specifically acknowledges is a good reason to provide a View All page but not want it indexed.)
      URL: http://example.com/firefighter/All
      Canonical: none
      Prev/Next: none
      **Noindex in <head>: <meta name="robots" content="noindex, follow">

    Final comment: I am still concerned about the potential implication of incorporating the prev/next markup, because some sources indicate that Google will assume that the series is one that is most logically read sequentially from page 1, which might be the case for an article, but is NOT true for my Q&A site, since each Q/A couplet is a stand-alone 'nugget'...and so I want organic search visitors to be directed to the paginated page that has the specific content s/he searched for, and NOT simply be automatically dumped on Page 1 of the series, which would be a crummy search experience, since the content searched for will often not be on page 1. There seems to be a lot of mixed messaging regarding just how often Google will redirect searchers to page 1 when rel=prev/next is used to indicate a related series of pages, but from my research the benefits (link juice consolidation, clean / 'best-practice' markup) seem to be important enough to risk it.

    Thanks everyone for helping...comments on the above?
  • aakk9999

    2:42 am on Jan 14, 2015 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    What you intend to do with regards to canonical, rel next/prev and noindexing ViewAll page seems fine to me (BTW, no need to say "follow", this is a default, it is enough just to say content="noindex").

    Regarding your concern, I have just done a test on the website that uses rel prev/next for listing of products, with 10 products per page. Searching for a short description in quotes of one of the products that appears on page 3 shows page 3 in SERPs (and not page 1) - which is what you want. The link of page 3 shown in SERPs also goes to page 3 and not to page 1.

    Therefore you should not be concerned. In any case, after you make your changes, give it some time for Google to index it, then search for part of the text on your Q&A page > 1 and you can confirm yourself it does what it supposed to do.

    domino66

    5:02 pm on Jan 22, 2015 (gmt 0)

    10+ Year Member Top Contributors Of The Month



    Thanks. BTW, silly question but does it matter if my HTML uses STRAIGHT quotes (") vs. CURLY quotes (”)?

    I added this line to the head section of the page I'm trying to noindex:
    <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>

    Now when I run the page through the w3 markup validation service, it's producing 3 errors I don't know how to interpret; can someone help me interpret?

      Warning Line 159, Column 50: Attribute follow” is not serializable as XML 1.0.
      <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>

      Error Line 159, Column 50: Bad value ”ROBOTS” for attribute name on element meta: Keyword ”robots” is not registered.
      <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
      Syntax of metadata name:
      A metadata name listed in the HTML specification or listed in the WHATWG wiki. You can register metadata names on the WHATWG wiki yourself.

      Error Line 159, Column 50: Attribute follow” not allowed on element meta at this point.
      <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
      Attributes for element meta:
      Global attributes
      name
      http-equiv
      content
      charset

    lucy24

    8:44 pm on Jan 22, 2015 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    does it matter if my HTML uses STRAIGHT quotes (") vs. CURLY quotes (”)?

    YES.
    The straight ("typewriter") quote is what is recognized by HTML syntax. A curly quote is just another character, like text.

    domino66

    9:47 pm on Jan 22, 2015 (gmt 0)

    10+ Year Member Top Contributors Of The Month



    Thanks, that's what I suspected; changed it and the w3 markup validator doesn't give me errors anymore.

    BUT what's a good tool/service for checking whether it's truly implemented properly. I tried running the URL through [seoreviewtools.com...] and it's telling me that it isn't finding any ROBOTS code at all / no restriction on indexing.

    Would someone else mind running one of their pages (that they KNOW is noindexed properly) through that checker...? Maybe it's not functioning properly?

    phranque

    10:26 pm on Jan 22, 2015 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

    fyi "FOLLOW" is default behavior and does nothing for you.

    lucy24

    12:07 am on Jan 23, 2015 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Maybe it's not functioning properly?

    For a given definition of "maybe". I fed it two of my noindex URLs-- in separate requests-- and each time it claimed the URL doesn't exist, though logs show it didn't even look. Further investigation suggests that it can't handle a line break at the end of a single item, though it's perfectly happy to process multiple-line requests. With that kind of persnicketiness, perhaps they can't handle your ALL CAPS?

    They do get points for managing to figure out that if a blank UA gets a 403, they should try a little honesty.

    domino66

    12:15 am on Jan 23, 2015 (gmt 0)

    10+ Year Member Top Contributors Of The Month



    LOL:)
    OK, thanks for the responses guys; i'm a little sensitive about it b/c of the importance of not screwing up a noindex.