Welcome to WebmasterWorld Guest from 23.23.46.20

Canonical URL Issues - including some new ones

   
4:22 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There's a potential canonical URL issue that we've not touched on often, if ever. It's the kind of thing that might cause indexing issues or split PageRank into different "piles" - and even, potentially, generate duplicate URL problems.

This canonical problem comes from adding a period to the end of a domain name - http://www.example.com. - and that can trigger a cascade of potential problems. If the trailing period is at the end of the domain name and the site's navigation uses relative urls, then the extra period gets carried forward, and forward, and forward, through succeeding links.

There's a new thread in our Apache Forum that touches on the issue, and it also shares a fix - [webmasterworld.com...] As moderator jdMorgan observes, even google.com. has this problem!

This kind of link can be generated innocently enough by forum software that automatically creates links for text strings that look like urls but are at the end of a sentence. And many servers will not have a problem resolving that url with an extra period.

So, for the sake of a complete reference, I'd like to collect the potential canonical url issues all in one place.

Canonical URL Issues
  1. Different domain names serving the same content (302 redirects can make this kind of mess)
  2. Different hostnames within one domain, such as "with-www" and "no-www" versions
  3. With and without "index.html" for the domain root or a subdirectory root
  4. Different protocols - https and http
  5. Trailing period on the domain name
  6. Double foward slash in the filepath - http://example.com//page.html
  7. Swapping the order of query string parameters
  8. URL rewrite that allows typos for the "keyworded" virtual directory name
  9. Any forum software or CMS that generates alternate urls for the same content
  10. URLs that include session parameters, clickpath tracking, etc.
  11. Adding a port number to the domain name: example.com:443
  12. URLs with unneeded query strings or extra parameters in the query string
6:49 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Whew, that's twelve of them. Combine them in one big pile and you have "12 factorial" - that's 479,001,600 - possible URLs for the exact same content!

Have I missed any?

7:29 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'd consider the IP address a special case to be noted.

[edited by: Robert_Charlton at 7:36 am (utc) on Aug. 8, 2008]

7:51 am on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Ah yes - factorial 13!
10:46 pm on Aug 8, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



... and there's more

* Domain name versus IP address of server.

* Your main domain name versus a named subdomain/folder of your hosting company's domain name.

* Trailing slash versus no trailing slash on folder names.

* Infinite wildcard subdomains.

* Trailing quotes on requests that have been auto-linked from forum and blog posts, or from sites that have botched the HTML code in their link to you.

* Trailing question mark on end of URL, but no parameters present.

* Differing capitalisation within URLs (mostly affects IIS).

* 'Fake' parameters (that are not processed) on the end of a URL for a non-dynamic site, or that are ignored on a dynamic site.

* Extra parameters on sideways links, like the "nextoldest" and "nextnewest" links in forums such as vBulletin and PHPbb.

* Differing drill-down paths within a website, or via internal search, where the content does not have a specific URL, but the URL is "built" using the path you took to get there.

* URLs where only part of the URL is needed and the rest is fluff. Affects blogs with SEF linking with keywords in the URL, but those are not used to pull a specific record from the database, hence yourdomain.com/blog/34567-blue-widgets.html and yourdomain.com/blog/34567-this-site-is-run-by-spammers-and-idiots-do-not-buy-this-junk.html will show the same content.

* URLs where there are extra parameters within that are used to "build" the navigational links on that page out to other related content.

* "Page one" problems. This is where a site has a sub-section with numbered pages, and where "page one" has a different URL depending on whether you get to it from a section index or from "page two".

* Moving pagination. This is where new content on &page=1 today, is moved to &page=2 tomorrow and new content appears on &page=1. The next day, content on &page=1 moves to &page=2 and content on &page=2 moves to &page=3 etc. The oldest content is forever appearing at a new URL each day, with a page number that is one greater than yesterday.


I use "daily" here as an example. It could be weekly, monthly, or random intervals, depending on how quickly new content is posted.
12:28 am on Aug 9, 2008 (gmt 0)

10+ Year Member



This canonical problem comes from adding a period to the end of a domain name

It is hard to believe that this can cause any problems as in DNS system dot at the end of a domain designates fully-qualified domain name.
So, some relative link ./filename called from example.com./dir should translate into example.com./dir/filename without any problems.

Do you have any negative experience with this?

12:48 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Golly sakes! The redirectS for all these must be enormous.
1:11 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>>Double foward slash in the filepath - http://example.com//page.html

Inbound links that look like this:

http://example.com//

3:01 am on Aug 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nice find Ted. I always just redirected anything behind the trailing slash of the root url to the exact root url to prevent this or any trailing character problem.
10:26 pm on Aug 10, 2008 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Here are some reference threads with more detail on some of these canonical url problems:

Duplicate Content [webmasterworld.com] - get it right or perish
Why "www" & "no-www" Are Different [webmasterworld.com] - the canonical duplicate issue
HTTPS versus HTTP [webmasterworld.com] - one more duplicate area
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Custom Error Pages [webmasterworld.com] - beware the server header status code
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - duplicate content pitfalls
5:48 pm on Aug 12, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Here's one more. The following URL is valid, but non-canonical:

http://www.WebmasterWorld.com/http://www.WebmasterWorld.com/

[added] The default server behaviour is to use the domain in the "local URL-path" part, as long as it's valid. [/added]

Jim

[edited by: jdMorgan at 5:51 pm (utc) on Aug. 12, 2008]

7:06 pm on Aug 12, 2008 (gmt 0)

10+ Year Member



Jim, that's really strange. Neither dot at the end nor that example should produce any problems.
Any working example?
7:26 pm on Aug 12, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The problem is not whether FQDN URLs or any of these other variations function. The problem is that they create more that one URL for a unique resource. This thread only makes sense in that context, and this is the reason for the thread title.

Google and most of the other search engines use back-end processes to "de-duplicate" or "canonicalize" URLs, and often infer the "correct" single URL for a resource. But we have many cases posted here of webmasters who say that the "wrong URL" is showing up in search results. Or they post that their home page in "www" is PR4, while their non-www home page is PR3, indicating that PageRank has been 'split' across these two domains.

They're not understanding that preventative server-side measures are required.

Then there's the issue of exploits. Given that a site could potentially have HTTP/HTTPS, www and non-www, trailing dot on the hostname, trailing port number on the hostname, "index.php" vs. "/", use 'virtual subdirectories' for SEO-friendly keyword-in-URL URLs while not enforcing a particular 'closed set' of values for that URL-path-part, and then adding practically-infinite query string variations, the number of URLs that could be used to reach a particular resource can indeed grow to be practically infinite -- limited only by the server settings which limit the length of the HTTP request header. A malicious competitor could potentially dilute the PageRank of your important pages with a bit of "creative linking."

In cases where large numbers of URLs resolve to a single resource, there are several dangers:

  • The "wrong" URL listed in search results.
  • PageRank/Link-popularity split among various URLs, reducing the rank of 'the' page.
  • So-called "duplicate-content penalties" --really a filter, IMO-- applied to resources with "too many" URLs.
  • User confusion (e.g. broken on-page visited-link highlighting).

    So the whole point of this thread (and several others that Tedster cited above) is that one resource (e.g. one "page") should have only and only one URL by which it is accessible, and all other "valid" variants of that URL should be redirected to that single canonical URL.

    Jim

  • 8:38 am on Aug 13, 2008 (gmt 0)

    10+ Year Member



    OK, I got the point :), but I am still not convinced with the two mentioned cases. I have never seen any technical or canonical problem either with the tld's trailing dot or with the domain name in directory or file structure.
    I have noted a few listed links where the default handling of the query string was tried to be abused, but couldn't see any effects in SERPS.
    Something like [webmasterworld.com...]
    8:03 pm on Aug 13, 2008 (gmt 0)

    WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



    I have fixed three "trailing dot" issues in the last year or so. I see it rarely, but the trend is upwards. All have been caused by auto-linking of URLs in forum or blog posts; where the designer of the forum or blog was sloppy in their URL-parsing rules when selecting what to auto-link.
    8:18 pm on Aug 13, 2008 (gmt 0)

    WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    http://www.example.com.

    Hmmm, that can't be good. Off to do some digging on my own. I want to find out exactly what purpose that dot is serving in this instance. It appears to affect every site I've visited. You can append that dot to any URI and it will return a 200. At least the ones I've tested so far.

    How many times do you see sentences ending with URI references and no trailing forward slash?

    Something like this http://www.example.com. I see it all the time.

    I always add a trailing forward slash to my written references. Whew, covers me there. But, that whole period thing is a concern. I need to dig up the RFCs and read, read, read. :)

    8:34 pm on Aug 13, 2008 (gmt 0)

    10+ Year Member



    The trailing dot speeds up name resolution and it actually should be a good practice to link to it!
    I don't think that can confuse bots in any way, including duplicate issues.
    8:42 pm on Aug 13, 2008 (gmt 0)

    WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    I don't think that can confuse bots in any way, including duplicate issues.

    I wouldn't be too certain about that just yet.

    I do notice that a different set of cookies is set when adding the dot. That can't be a good thing in this scenario, can it?

    8:48 pm on Aug 13, 2008 (gmt 0)

    10+ Year Member



    There was an exploit regarding trailing dot and cookies, which basically allowed reading all the cookies on the machine. But it was not the failure of the dot, but rather a browser. I think it is closed by now.

    P.S. It was a triple trailing dot.

    8:52 pm on Aug 13, 2008 (gmt 0)

    WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



    *** I don't think that can confuse bots in any way, including duplicate issues. ***

    You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

    9:16 pm on Aug 13, 2008 (gmt 0)

    10+ Year Member



    You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

    Are you sure it was "tld./" and not "tld/."?

    [edited by: Robert_Charlton at 7:18 am (utc) on Aug. 14, 2008]

    10:03 pm on Aug 13, 2008 (gmt 0)

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    That'd be nice to know from an academic standpoint. But the fact remains that unless we take steps on the server side to absolutely prevent multiple URLs from accessing content, we hand over the 'welfare' of our PageRank and link-popularity to an 'extra step' of back-end de-duplication processing by the search engines.

    That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.

    If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."

    Jim

    12:24 am on Aug 22, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    #1 Server Response: http://www.example.com./google/3718246-3-10.htm.
    HTTP Status Code: HTTP/1.1 200 OK
    Date: Fri, 22 Aug 2008 00:23:11 GMT

    Oops !

    [edited by: tedster at 1:15 am (utc) on Aug. 22, 2008]
    [edit reason] switch to example.com - it can never be owned [/edit]

    1:12 pm on Aug 22, 2008 (gmt 0)

    10+ Year Member



    #1 Server Response: http://www.example.com./google/3718246-3-10.htm.

    Probably rewriting to QS which allows non-existent strings. The dot does not make the difference.

    12:12 am on Aug 24, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    activeco,

    It really doesn't make a difference. The end result is multiple names for the same information.

    I can show the same thing for many sites. I just wanted to point out that even WebmasterWorld has issues in the area of canonical names.

    5:41 pm on Aug 27, 2008 (gmt 0)

    5+ Year Member



    Added the following rewrite rule and it seeems to be doing the trick.

    # Rewrite domains with a trailing period
    RewriteCond %{HTTP_HOST} ^(.*)\.$
    RewriteRule ^(.*)$ [%1...] [R=301,L]

     

    Featured Threads

    My Threads

    Hot Threads This Week

    Hot Threads This Month