|Google Offers Canonical Support in the HTTP Header|
|Based on your feedback, we’re happy to announce that Google web search now supports link rel="canonical" relationships specified in HTTP headers as per the syntax described in section 5 of IETF RFC 5988. Webmasters can use rel="canonical" HTTP headers to signal the canonical URL for both HTML documents and other types of content such as PDF files. |
I don't get it. What's the difference between this and the current use of Canonicals?
You can't put rel=canonical in a PDF file, but you can now send it in the HTTP header that precedes the file content.
Interesting, I guess that means it could be used to cover canonical URLs for images too.
This is going to be a nightmare to implement for an ordinary webmaster.
It's kind of parallel to the X-Robots directive. I don't think the average webmaster even knows about that, either. Or etags, or a host of other technology tools that can be very useful.
True, but on the other hand I would think you would have to have some pretty specialized requirements in order to need this at all. I don't recall ever having had duplicate content problems with non-HTML doctypes. The vast majority of webmasters would probably never even notice if they had a problem that this might be a solution to, let alone attempt to implement it.
|It's kind of parallel to the X-Robots directive. I don't think the average webmaster even knows about that, either. Or etags, or a host of other technology tools that can be very useful. |
Yes, it is kind of, but a bit more complicated. With X-Robots you would usually return this for a certain resource type (e.g. all .doc resources or similar), and the number of valid combinations is limited. With an etag, you can go for an algorhitm that is then codded once and applied in many responses.
With canonical in HTTP headers for a pdf document for example - every pdf document would have to have somewhere an information on its canonical resource (e.g. on its canonical HTML page). So somewhere a table of pdf documents (or other resources) with associated canonical info for each, to be returned in HTTP headers.
Perhaps CMS systems like Wordpress will develop plugin for it.
I am wondering what would happen if HTTP Headers return one canonical, and there is also a canonical link element in the head section of the document - which one takes the preference?
<added>I am wondering if this works cross-domain? Because it could be a heaven for hackers - it is bad enough searching for canonical insertion in your html via view source, but now it seems we would have to inspect headers too ?</added>
|I don't recall ever having had duplicate content problems with non-HTML doctypes. |
all it takes is an IIS server configured to be case-insensitive and you now have (2**N - 1) non-canonical urls where N is the number of alphabetic characters in the url path and file name/extension.
it's a very common - here's an example from a .gov site:
that's 131,071 non-canonical urls.
added: the google blog post that announces this (linked above) gives an example of a pdf document that is a duplicate of an html document where the html doc is the canonical url.
[edited by: phranque at 2:27 am (utc) on Jun 18, 2011]
|I am wondering if this works cross-domain? Because it could be a heaven for hackers - it is bad enough searching for canonical insertion in your html via view source, but now it seems we would have to inspect headers too ? |
assuming it's implemented as is the link rel canonical, it works cross-domain.
About rel="canonical" - Webmaster Tools Help:
not sure how easy it would be to inject HTTP headers without access to the server.
also, the link rel canonical only works in the <head>, not in the <html> section of a text/html document, which reduces the possibility of canonical injection through UGC.
Anyone with access to the packets on the way from your site to the end user, browser, or bot can potentially change the headers.
Here's an easy way for proxies to "make documents their own", by pointing the HTTP rel=canonical header to their own proxy-based "virtual URL" for the document.
true that, but what are the chances a hacker will have such access between your server and googlebot.
and googlebot shouldn't be retrieving any headers from proxy servers that are not under your control.
To answer a couple of questions:
1. We currently support the rel="canonical" HTTP headers for web search only, and we're keeping an eye on how webmasters use them to see if we should support other Google properties.
2. The rel="canonical" HTTP header does work across domains, just like a link rel="canonical" link tags in HTML.
Thanks Pierre. It's always appreciated when we can get input from a Google employee.
It looks useful for non-HTML content.
It should not be difficult to apply to dynamically generated content. It also looks possible to do it with static content with flexible servers like Apache or Lighttpd but you might end up doing a lot of server configuration....
This new header entry is made of win.
I've got some reasonably image intensive websites. Some years ago I added support for "parallelization" and I was amazed at the performance boost.
Typically a web page with 20 images (thumbnails, etc) will all be served from a singular domain. Example:
A web browser will only open +/- 4 simultaneous connections to a single domain. Therefore, in the above example, 4 connections will each request +/- 5 files in succession. The time required to retrieve the entire page might be +/- 1.5 seconds.
With support for "parallelization", my example web page references the following files:
In this example, a web browser creates 4 simultaneous connections to each of the 5 domains. The entire web page and it's required images are all downloaded at once. The time required to retrieve the entire page drops from +/- 1.5 seconds to +/- 0.5 seconds. A huge speed improvement.
It is not difficult in a parallel distribution of images to have the same file served from multiple domains:
This is obviously a problem for search engine spiders and - to a lesser extent - webmasters on metered bandwidth. In my example, if I have 3gb of images, the spider potentially downloads 5 copies totaling 15gb.
In my case, I would previously block indexing of my parallelized images via robots.txt files on i1.*, i2.*, etc. This worked fine. However, things have changed. Google's thumbnailing of pages in search plus the new instant preview thing makes blocking less than ideal. My instant preview thumbnails had missing images (anything loaded from i1.*, i2.* etc were just blank rectangles in the thumbnail) and didn't look quite right.
This new "Link:" header is AWESOME. For example, when any of the following images are requested:
The following header can be included in the HTTP Response:
Link: <http://www.blah.com/images/1.jpg>; rel="canonical"
I'm SO fecking excited about this, I'm having nerdgasms. Time was, the Google Page Speed plugin came out and told me how great image parallelization was and how all the cool kids are doing it. So of course I spent a few days and implemented support for it in my network. However, 3-4 weeks later I lost 15% of my traffic because I had effectively lost all of my image search referrals due to the relocation of images. I never regained these referrals. This made me sad. But my web pages were rendering on browsers so much faster that I wasn't willing to go back to the old school framework. I'm hoping that support for this header will get me back some image referrals. Image referrals are mostly empty calories, but they're good for the ego.
If there are any other uber geeks on WebmasterWorld interested in adding support for this header, here are some notes, some of which may be obvious but I'll include them anyway:
1. A lightweight web server (i.e. not Apache) should be used to serve up your static images.
2. I've previously been using thttpd that I modified slightly to support my directory structure. However, I just couldn't get it to support this new "Link:" header w/ additional code without it segmentation faulting all over the place. And I was a pretty good C programmer back in the day.
3. I tried but couldn't get lighttpd to support it. It's fast as hell though. I might try again later.
4. After development and testing in my sandbox, I'm transitioning to Cherokee for image serving. It's code is well written so it was not terribly difficult to sort out where and how to add support for this header. If you want the 6-10 lines of code I added to support my framework (shouldn't be hard to adapt to yours) then PM me.
Wow! I've never had a nerdgasm. Sounds good!
I moved to 5x parallelization last month to increase performance. Shaved 1.3 seconds. Was hoping that a bit more speed would get me some of my Panda II losses back (wishful thinking). This is a cool update. We already use canonicals to strip out PPC tracking parameters and to deal with case inconsistencies from inbound links created by stupid people who change the case of our URL. I'll now use this for my images. I wounder if headers can be injected into Akami's CDN?
That's a very interesting approach, but as the rel="canonical" HTTP header is web search only at the moment, it will not work as you imagine right now. The more technically correct approach is to parallelize the images loading on page load, not parallelize the images themselves. To rephrase, the Page Speed recommendation is not to duplicate images across lots of sub-domains but it is to put multiple (non-identical) images of an HTML page onto different hosts. This means that each image has only one URL to be used throughout the site, but the different images on the page are on URLs on different hostnames.
To take your example, this kind of set up:
is better than having a situation of
In reality of course, i1, i2, i3, etc can be set up as CNAME or A records in the DNS to the same server or content distribution network or multiple IP addresses configured as per the Site Speed recommendations (e.g. to send browser caching headers).
Hope this helps,
@seanpecor, the easy way to do it on Lighttpd would be to use mod_magnet. It may be slightly slower if you are looking for extreme performance, but its not going to make a difference for most sites.