|Common Mistakes With rel=canonical|
Google has described the five most common mistakes when using rel=canonical, so I thought it worth documenting.
|Including a rel=canonical link in your webpage is a strong hint to search engines your preferred version to index among duplicate pages on the web. It’s supported by several search engines, including Yahoo!, Bing, and Google. The rel=canonical link consolidates indexing properties from the duplicates, like their inbound links, as well as specifies which URL you’d like displayed in search results. However, rel=canonical can be a bit tricky because it’s not very obvious when there’s a misconfiguration.Common Mistakes With rel=canonical [googlewebmastercentral.blogspot.co.uk] |
|We recommend the following best practices for using rel=canonical: |
A large portion of the duplicate page’s content should be present on the canonical version.
One test is to imagine you don’t understand the language of the content—if you placed the duplicate side-by-side with the canonical, does a very large percentage of the words of the duplicate page appear on the canonical page? If you need to speak the language to understand that the pages are similar; for example, if they’re only topically similar but not extremely close in exact words, the canonical designation might be disregarded by search engines.
Double-check that your rel=canonical target exists (it’s not an error or “soft 404”)
Verify the rel=canonical target doesn’t contain a noindex robots meta tag
Make sure you’d prefer the rel=canonical URL to be displayed in search results (rather than the duplicate URL)
Include the rel=canonical link in either the <head> of the page or the HTTP header
Specify no more than one rel=canonical for a page. When more than one is specified, all rel=canonicals will be ignored.
I just generated a new sitemap and my sitemap reports a lot of 301 statuses for links to http://www.example.com/subdirectory/ without the index.html . So what's the best convention for linking to subdirectories? I always thought is was the format above, leaving the index.html out. This might explain a big problem on my site. Also it's possible I have the following...
(without the trailing slash)
can someone hit me with some sage advice?
or a tack hammer?
[edited by: engine at 2:33 pm (utc) on Apr 10, 2013]
[edit reason] example.com [/edit]
rel=canonical cleaned up a lot of issues for us but didn't seem to reduce the number of pages in the G directory, just their priority.
@backdraft7 I think the 301 is applied when you forget to add the trailing forward slash, most browsers demonstrate it if you try it without the /; even in a case where DirectoryIndex is naming a different page as index it still applies the / because technically you're inside the directory (Browser says to itself, "This folder is where he wanted to go, let's step inside.")
If lucy24 is around soon she'll probably lay down some technical reasons and advice.
With the trailing / is correct.
In the case of http://www.example.com/subdirectory, subdirectory is actually a location (page) within the example.com/ directory.
|I think the 301 is applied when you forget to add the trailing forward slash, most browsers demonstrate it if you try it without the / |
Not sure what you mean by this.
Servers redirect, not browsers.
Browsers simply request the location they're told to by the user or server.
|If lucy24 is around soon she'll probably lay down some technical reasons and advice. |
Funny you should say that...
Are you on Apache? If it's IIS I'll assume there is an equivalent function under a different name.
User, meet mod_dir. mod_dir, meet user. This module works silently in the background; most people never need to think about it. It does two jobs. By default:
When there is a request for an extensionless URL, it checks whether there's a directory by that name. If yes, it issues a redirect-- not a rewrite but a full-fledged 301 redirect-- to the identical URL with trailing slash. You can see this in your site logs if you deliberately request a slashless directory. Or just wait; some robots do this as a matter of course. (The name MJ12 comes to mind.)
Then, when there's a request for a directory with trailing slash, mod_dir quietly looks for a named index.html file. (There's a list of specific names to look for. You can change this list.) If yes, it quietly serves up the file. If no, it checks whether you've got auto-indexing enabled. If yes, user gets a pseudo-page that shows everything in the directory. If no, user gets a 403.
Exception: If the request is for your top-level domain name, like www.example.com, then the slash is added by the browser itself before the request ever reaches your site. mod_dir then proceeds directly to step two: looking for an index.html file.
... and that's why you don't need "index.html". Any request for /directory/ will get you the content of /directory/index.html if it exists. If you now request /directory/index.html by name, you've got two ways of reaching the same page.
Now, google says to put "rel=canonical" on any NONcanonical pages. Obviously this is meaningless in the case of index.html pages, because it's the same physical file. Same goes for some other canonical problems, like static html pages with garbage attached to the URL. You can only say "rel=canonical" if you don't mind having a page link to itself. (As a user, I find this seriously annoying and confusing. "Wasn't I here already?")
|Obviously this is meaningless in the case of index.html pages, because it's the same physical file. Same goes for some other canonical problems, like static html pages with garbage attached to the URL. You can only say "rel=canonical" if you don't mind having a page link to itself. (As a user, I find this seriously annoying and confusing. "Wasn't I here already?") |
That makes no sense.
If you put <link rel="canonical" href="http://www.example.com/dir/"> on dir/index.html Google will "get it" and consider dir/ the canonical version.
If you put <link rel="canonical" href="http://www.example.com/dir/html-page.html"> on /dir/html-page.html and somehow Google requests /dir/html-page.html?somevar=someval your user will never know and Google will "get it" and consider the parameterless page the canonical.
It's not at all useless to put it on there and you don't need to link a page to itself to use it, so there's no reason your users should even know it's there, unless they look.
|You can only say "rel=canonical" if you don't mind having a page link to itself. |
Where did you get that idea?
You don't even need it on the page, you can serve it in the header.
|Last week Google, Yahoo, and Microsoft announced support for a new link element to clean up duplicate urls on sites. The syntax is pretty simple: An ugly url such as http://www.example.com/page.html?sid=asdf314159265 can specify in the HEAD part of the document the following: |
<link rel="canonical" href="http://example.com/page.html"/>
That tells search engines that the preferred location of this url (the “canonical” location, in search engine speak) is http://example.com/page.html instead of http://www.example.com/page.html?sid=asdf314159265 .
|Dave, it’s totally fine for the preferred version of the page to point back to itself. I’d recommend using absolute urls just to prevent any potential problems from popping up, but that should be no problem at all. |
The effect can be seen (demonstrated) using most browsers; the server does the redirect but the browser shows it visually. Only said "most" because I believe one of the browsers might leave it off visually if you do
|Not sure what you mean by this. |
Servers redirect, not browsers.
To be completely open about the subject
was news to me, don't know why I didn't take notice to that before, might be making a change soon to see if there's a difference afterward.
|you can serve it in the header. |
As for the frequency of rel=canonical @lucy24, we don't use it on the actual page it belongs to as they suggest so index pages don't include the tag/code in them, and those wouldn't point to just the .tld/ or directory/ either (don't know if you'd canonicalize index.html to "/" since that alone is index, no matter what you name the index page - hope that's phrased right for everyone ...does not compute....)
|...don't know if you'd canonicalize index.html to "/" since that alone is index, no matter what you name the index page - hope that's phrased right for everyone ...does not compute.... |
There's no reason not to if you have a preferred version to be displayed but can't use a 301 for some reason. Many sites can use a 301 and "strip" the index.ext via external redirect so there's only one location for the information physically available, but if you can't redirect, I would think that's a good use of the canonical to indicate which version (dir/ or dir/index.ext) should be shown in the results.
If you don't have a preferred version, then it's no big deal to not use it, because Google will just pick one themselves to treat as the canonical. Using the tag really just gives you more control over what URL is show in the results when there's multiple choices for the same or essentially the same information.
[edited by: TheOptimizationIdiot at 8:50 pm (utc) on Apr 10, 2013]
|Where did you get that idea? |
A lot of different concepts are getting tossed around in this thread. canonical is one thing. rel is another. <head> and HTTP header are different things.
|don't know if you'd canonicalize index.html to "/" since that alone is index, no matter what you name the index page - hope that's phrased right for everyone ...does not compute.... |
Well, you'd hope that links to "index.html" would never actually occur. But I guess if you've got outside links using this form, you might benefit from using "rel='canonical'" on your internal links.
|note that to use this option, you'll need to be able to configure your server |
Thanks, google, for that wildly misleading utterance. Honestly, doesn't that make it sound as if you can't issue a header unless you've got control of the config file? Depending on page, you may not even need htaccess.
|Adding rel="canonical" to the head section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases |
Again... Doesn't that seem to imply that the http header is only relevant for non-page documents? And how many versions of a pdf have you got, anyway? It's not something you'd ordinarily construct on the fly, except in response to an individual human request.
|canonical is one thing. rel is another. |
You're really confusing the issue for people, imo.
rel and canonical and href go together in a <link> in the head of the page or in the HTTP header.
Link Rel in the <head>
<link rel="canonical" href="http://www.example.com/page.html">
Link Rel in the HTTP Header
Link: <http://www.example.com/page.html>; rel="canonical"
The concept is canonicalization.
There's nothing different being "tossed around" here.
|you might benefit from using "rel='canonical'" on your internal links. |
It would do no good to put rel="canonical" on links. It Must go in the <head> section of the document or the HTTP header to be recognized. Anywhere else and it gets completely ignored.
|Doesn't that seem to imply that the http header is only relevant for non-page documents? |
No. You're reading too much into things and really not understanding rel=canonical.
It was designed from the start to be flexible and easy to use.
You can put it in the <head> or HTTP header. Either is fine. Putting it anywhere else will have no effect whatsoever, otherwise people could drop a link here (or anywhere that allows links) with rel=canonical pointing to their site and effectively hijack the page.
|And how many versions of a pdf have you got, anyway? |
It's not about how many version of the pdf you have it's about having an HTML page and a pdf (or plain text print version or xml/rss version or other HTML version) with the same information and being able to say which version is preferred for search engines to show in the results.
I have a question regarding use of the canonical tag on pagination pages that display real estate listings.
Each of my pages display 15 results and pages are displayed by city. The problem is that throughout the year the number of listing results for a give city changes from as few as 20 listings to over 200 at times.
In the two years of my site's existence I've just let Google handle the indexing and had no tags/robots setup in header.
About 4 months ago, while results per city is lowest, we started receiving 100's of soft 404's in our WST account. But the URL's being considered soft 404's are the deeper pagination pages that were showing real estate listings when we had a large number in that city, but now are blank result pages.
For technical reasons we are unable to setup the recommended rel=”prev” and rel=”next” tags. And setting up a "view all" page wouldn't work because of the high number of listings we have at times.
We were trying to handle the issue by setting all pagination pages after "page 1" with a canonical tag pointing to "page 1" but the pagination pages are still showing up in the Google index and are competing with our "page 1" result page.
We monitored the pages for close to three months but we didn't see any improvements. Actually, we started seeing that several of our "page 1" results were having their title content truncated to short/vague text.
Given all of those issues, we decided to add a robots "noindex" tag in header of pagination pages along with the canonical tag we setup pointing to "page 1" result page.
After adding the "noindex" tag we are noticing fewer pages "competing" in results. But this is a poor long-term solution considering all of the results pages we have to prevent from being indexed.
Does anyone have experience with a challenge like this?
With regards to using canonical on pages 2 onwards to point to page 1 result page, Google says the following:
5 common mistakes with rel=canonical
|Mistake 1: rel=canonical to the first page of a paginated series |
We had a very recent discussion on pagination on this thread, maybe some views will help: