Canonical URL Issues - including some new ones - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Canonical URL Issues - including some new ones

tedster

4:22 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There's a potential canonical URL issue that we've not touched on often, if ever. It's the kind of thing that might cause indexing issues or split PageRank into different "piles" - and even, potentially, generate duplicate URL problems.

This canonical problem comes from adding a period to the end of a domain name - http://www.example.com. - and that can trigger a cascade of potential problems. If the trailing period is at the end of the domain name and the site's navigation uses relative urls, then the extra period gets carried forward, and forward, and forward, through succeeding links.

There's a new thread in our Apache Forum that touches on the issue, and it also shares a fix - [webmasterworld.com...] As moderator jdMorgan observes, even google.com. has this problem!

This kind of link can be generated innocently enough by forum software that automatically creates links for text strings that look like urls but are at the end of a sentence. And many servers will not have a problem resolving that url with an extra period.

So, for the sake of a complete reference, I'd like to collect the potential canonical url issues all in one place.

Canonical URL Issues
Different domain names serving the same content (302 redirects can make this kind of mess)
Different hostnames within one domain, such as "with-www" and "no-www" versions
With and without "index.html" for the domain root or a subdirectory root
Different protocols - https and http
Trailing period on the domain name
Double foward slash in the filepath - http://example.com//page.html
Swapping the order of query string parameters
URL rewrite that allows typos for the "keyworded" virtual directory name
Any forum software or CMS that generates alternate urls for the same content
URLs that include session parameters, clickpath tracking, etc.
Adding a port number to the domain name: example.com:443
URLs with unneeded query strings or extra parameters in the query string

tedster

6:49 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Whew, that's twelve of them. Combine them in one big pile and you have "12 factorial" - that's 479,001,600 - possible URLs for the exact same content!

Have I missed any?

Robert Charlton

7:29 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I'd consider the IP address a special case to be noted.

[edited by: Robert_Charlton at 7:36 am (utc) on Aug. 8, 2008]

tedster

7:51 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ah yes - factorial 13!

g1smd

10:46 pm on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

... and there's more
* Domain name versus IP address of server.
* Your main domain name versus a named subdomain/folder of your hosting company's domain name.
* Trailing slash versus no trailing slash on folder names.
* Infinite wildcard subdomains.
* Trailing quotes on requests that have been auto-linked from forum and blog posts, or from sites that have botched the HTML code in their link to you.
* Trailing question mark on end of URL, but no parameters present.
* Differing capitalisation within URLs (mostly affects IIS).
* 'Fake' parameters (that are not processed) on the end of a URL for a non-dynamic site, or that are ignored on a dynamic site.
* Extra parameters on sideways links, like the "nextoldest" and "nextnewest" links in forums such as vBulletin and PHPbb.
* Differing drill-down paths within a website, or via internal search, where the content does not have a specific URL, but the URL is "built" using the path you took to get there.
* URLs where only part of the URL is needed and the rest is fluff. Affects blogs with SEF linking with keywords in the URL, but those are not used to pull a specific record from the database, hence yourdomain.com/blog/34567-blue-widgets.html and yourdomain.com/blog/34567-this-site-is-run-by-spammers-and-idiots-do-not-buy-this-junk.html will show the same content.
* URLs where there are extra parameters within that are used to "build" the navigational links on that page out to other related content.
* "Page one" problems. This is where a site has a sub-section with numbered pages, and where "page one" has a different URL depending on whether you get to it from a section index or from "page two".
* Moving pagination. This is where new content on &page=1 today, is moved to &page=2 tomorrow and new content appears on &page=1. The next day, content on &page=1 moves to &page=2 and content on &page=2 moves to &page=3 etc. The oldest content is forever appearing at a new URL each day, with a page number that is one greater than yesterday.

I use "daily" here as an example. It could be weekly, monthly, or random intervals, depending on how quickly new content is posted.

activeco

12:28 am on Aug 9, 2008 (gmt 0)

10+ Year Member

This canonical problem comes from adding a period to the end of a domain name

It is hard to believe that this can cause any problems as in DNS system dot at the end of a domain designates fully-qualified domain name.
So, some relative link ./filename called from example.com./dir should translate into example.com./dir/filename without any problems.

Do you have any negative experience with this?

Lorel

12:48 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Golly sakes! The redirectS for all these must be enormous.

Marcia

1:11 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>Double foward slash in the filepath - http://example.com//page.html

Inbound links that look like this:

http://example.com//

CainIV

3:01 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Nice find Ted. I always just redirected anything behind the trailing slash of the root url to the exact root url to prevent this or any trailing character problem.

tedster

10:26 pm on Aug 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here are some reference threads with more detail on some of these canonical url problems:

Duplicate Content [webmasterworld.com] - get it right or perish
Why "www" & "no-www" Are Different [webmasterworld.com] - the canonical duplicate issue
HTTPS versus HTTP [webmasterworld.com] - one more duplicate area
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Custom Error Pages [webmasterworld.com] - beware the server header status code
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - duplicate content pitfalls

jdMorgan

5:48 pm on Aug 12, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here's one more. The following URL is valid, but non-canonical:

http://www.WebmasterWorld.com/http://www.WebmasterWorld.com/

[added] The default server behaviour is to use the domain in the "local URL-path" part, as long as it's valid. [/added]

Jim

[edited by: jdMorgan at 5:51 pm (utc) on Aug. 12, 2008]

activeco

7:06 pm on Aug 12, 2008 (gmt 0)

10+ Year Member

Jim, that's really strange. Neither dot at the end nor that example should produce any problems.
Any working example?

jdMorgan

7:26 pm on Aug 12, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The problem is not whether FQDN URLs or any of these other variations function. The problem is that they create more that one URL for a unique resource. This thread only makes sense in that context, and this is the reason for the thread title.

Google and most of the other search engines use back-end processes to "de-duplicate" or "canonicalize" URLs, and often infer the "correct" single URL for a resource. But we have many cases posted here of webmasters who say that the "wrong URL" is showing up in search results. Or they post that their home page in "www" is PR4, while their non-www home page is PR3, indicating that PageRank has been 'split' across these two domains.

They're not understanding that preventative server-side measures are required.

Then there's the issue of exploits. Given that a site could potentially have HTTP/HTTPS, www and non-www, trailing dot on the hostname, trailing port number on the hostname, "index.php" vs. "/", use 'virtual subdirectories' for SEO-friendly keyword-in-URL URLs while not enforcing a particular 'closed set' of values for that URL-path-part, and then adding practically-infinite query string variations, the number of URLs that could be used to reach a particular resource can indeed grow to be practically infinite -- limited only by the server settings which limit the length of the HTTP request header. A malicious competitor could potentially dilute the PageRank of your important pages with a bit of "creative linking."

In cases where large numbers of URLs resolve to a single resource, there are several dangers:

The "wrong" URL listed in search results.

PageRank/Link-popularity split among various URLs, reducing the rank of 'the' page.

So-called "duplicate-content penalties" --really a filter, IMO-- applied to resources with "too many" URLs.

User confusion (e.g. broken on-page visited-link highlighting).

So the whole point of this thread (and several others that Tedster cited above) is that one resource (e.g. one "page") should have only and only one URL by which it is accessible, and all other "valid" variants of that URL should be redirected to that single canonical URL.

Jim

activeco

8:38 am on Aug 13, 2008 (gmt 0)

10+ Year Member

OK, I got the point :), but I am still not convinced with the two mentioned cases. I have never seen any technical or canonical problem either with the tld's trailing dot or with the domain name in directory or file structure.
I have noted a few listed links where the default handling of the query string was tried to be abused, but couldn't see any effects in SERPS.
Something like [webmasterworld.com...]

g1smd

8:03 pm on Aug 13, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have fixed three "trailing dot" issues in the last year or so. I see it rarely, but the trend is upwards. All have been caused by auto-linking of URLs in forum or blog posts; where the designer of the forum or blog was sloppy in their URL-parsing rules when selecting what to auto-link.

pageoneresults

8:18 pm on Aug 13, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

http://www.example.com.

Hmmm, that can't be good. Off to do some digging on my own. I want to find out exactly what purpose that dot is serving in this instance. It appears to affect every site I've visited. You can append that dot to any URI and it will return a 200. At least the ones I've tested so far.

How many times do you see sentences ending with URI references and no trailing forward slash?

Something like this http://www.example.com. I see it all the time.

I always add a trailing forward slash to my written references. Whew, covers me there. But, that whole period thing is a concern. I need to dig up the RFCs and read, read, read. :)

activeco

8:34 pm on Aug 13, 2008 (gmt 0)

10+ Year Member

The trailing dot speeds up name resolution and it actually should be a good practice to link to it!
I don't think that can confuse bots in any way, including duplicate issues.

pageoneresults

8:42 pm on Aug 13, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think that can confuse bots in any way, including duplicate issues.

I wouldn't be too certain about that just yet.

I do notice that a different set of cookies is set when adding the dot. That can't be a good thing in this scenario, can it?

activeco

8:48 pm on Aug 13, 2008 (gmt 0)

10+ Year Member

There was an exploit regarding trailing dot and cookies, which basically allowed reading all the cookies on the machine. But it was not the failure of the dot, but rather a browser. I think it is closed by now.

P.S. It was a triple trailing dot.

g1smd

8:52 pm on Aug 13, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

*** I don't think that can confuse bots in any way, including duplicate issues. ***

You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

activeco

9:16 pm on Aug 13, 2008 (gmt 0)

10+ Year Member

You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

Are you sure it was "tld./" and not "tld/."?

[edited by: Robert_Charlton at 7:18 am (utc) on Aug. 14, 2008]

jdMorgan

10:03 pm on Aug 13, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That'd be nice to know from an academic standpoint. But the fact remains that unless we take steps on the server side to absolutely prevent multiple URLs from accessing content, we hand over the 'welfare' of our PageRank and link-popularity to an 'extra step' of back-end de-duplication processing by the search engines.

That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.

If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."

Jim

theBear

12:24 am on Aug 22, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

#1 Server Response: http://www.example.com./google/3718246-3-10.htm.
HTTP Status Code: HTTP/1.1 200 OK
Date: Fri, 22 Aug 2008 00:23:11 GMT

Oops !

[edited by: tedster at 1:15 am (utc) on Aug. 22, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]

activeco

1:12 pm on Aug 22, 2008 (gmt 0)

10+ Year Member

#1 Server Response: http://www.example.com./google/3718246-3-10.htm.

Probably rewriting to QS which allows non-existent strings. The dot does not make the difference.

theBear

12:12 am on Aug 24, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

activeco,

It really doesn't make a difference. The end result is multiple names for the same information.

I can show the same thing for many sites. I just wanted to point out that even WebmasterWorld has issues in the area of canonical names.

keto

5:41 pm on Aug 27, 2008 (gmt 0)

10+ Year Member

Added the following rewrite rule and it seeems to be doing the trick.

# Rewrite domains with a trailing period
RewriteCond %{HTTP_HOST} ^(.*)\.$
RewriteRule ^(.*)$ [%1...] [R=301,L]